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Abstract 


This  thesis  is  a study  of  speech  recognition  at  the  parametric  level.  It  attempts  to 
evaluate  and  understand  the  relative  merits  of  a number  of  alternative  design  choices  at 
that  level.  Such  a study  raises  issues  in  Artificial  Intelligence,  Linguistics,  Acoustics, 
Pattern  P*  cognition,  Statistics,  and  Speech  Understanding  research.  In  particular,  it 
involves  an  investigation  of  segmentation  and  labeling  techniques,  and  the  use  of 
parametric  representat  ons  for  the  acoustic  signal  in  those  techniques.  Every  speech 
recognition  system  employs  some  parametric  representation  and  some  initial  signal  to 
symbol  transformation.  We  show  the  performance  currently  available  for  these  initial 
processes,  and  assert  that  such  performance  is  comparable  to  human  performance.  We 
present  the  relative  merits  of  some  typical  parametric  representations,  and  develop  a 
methodology  for  such  comparative  evaluation.  Simple,  parameter-independent  schemes  for 
segmenting,  labeling,  and  training  are  developed  as  well.  The  role  of  pattern  classification 
techniques  is  clarified,  as  it  relates  to  the  initial  signal  to  symbol  transformation. 


Tour  parametric  representations  have  been  chosen  for  study:  a set  of  amplitudes 
and  zero-crossing  measurements  from  5 octave  filters  (ZCC)j  a set  of  energy 
measurements  from  a 1/3  octave  filter  bank  (ASA);  a smoothed,  short-time  spectrum 
computed  from  the  LPC  filter  (SPG);  and  the  LPC  coefficients  themselves  (ACS).  Note  that 
the  first  two  involve  the  use  of  analog  devices.  Each  method  yields  a set  of  measurements 
at  uniform,  short  intervals  — a pattern.  Distance  functions,  chosen  from  Pattern 
Classification  theory,  are  then  applied  to  the  parameter  patterns  as  measures  of  acoustic 
similarity. 

A method  for  segmenting  speech  into  isolated,  acoustically  consistent  segments  is 
presented.  The  method  is  fairly  independent  of  the  choice  of  parametric  representation, 
since  it  relies  upon  the  acoustic  similarity  measure  as  the  primary  evidence  of  acoustic 
change.  Missing  and  extra  segment  errors  are  found  to  be  as  good  as  AZ  and  19Z, 
respectively.  Significant  differences  in  the  segmentation  effectiveness  of  the  parametric 
representations  is  found.  They  may  be  ordered  as  follows:  SPG,  ACS,  ASA,  and  ZCC.  The 
best  performance  is  found  to  be  comparable  to  the  state  of  the  art.  Little  reduction  in 
accuracy  is  encountered  when  new  speakers  are  tested. 

Labeling  is  accomplished  by  the  same  pattern  similarity  measures.  However, 
similarity  is  measured  between  the  unknown  pattern  and  each  of  a set  of  stored  templates. 
A clustering  algorithm  is  presented  which  finds  the  most  suitable  set  of  templates  to 
represent  a population  of  patterns  which  correspond  lo  a particular  phonetic  label.  The 
patterns  tested  are  those  isolated  by  the  oest  machine  segmentation  routine,  hand 
corrected  for  serious  errors. 

Little  difference  is  observed  along  the  parametric  representation  or  the 
classification  metric  dimensions,  except  for  poorer  performance  for  ZCC  input.  Each  input 
segment  is  labeled  as  one  of  a set  of  AO  phone  labels.  The  correct  phone  appears  as  the 
first  choice  28Z  of  the  time.  It  appears  in  the  first  three  choices  557.  of  the  time. 
However,  when  a lower  level,  acoustic  transcription  is  used  as  evaluation  referent,  these 
values  increase  to  A2Z  and  65Z.  Even  the  287.  accuracy,  which  arises  from  a comparison 
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against  phonemic  expectation,  is  acceptable  performance.  It  is  the  same  as  or  slightly 
better  than  human  spectrogram  reading  performance  in  the  absence  Of  other  linguistic 
clues. 

The  major  contributions  are  as  follows.  1)  Simple  yet  effective,  parameter- 
independent  procedures  for  segmenting  and  labeling  speech  are  developed.  2)  A 
methodology  for  performance  evaluation  at  this  level  is  presented.  3)  A number  of 
alternative  design  choices  are  examined.  4)  A better  understanding  is  offeied  of  the  role 
of  pattern  classification  techiniques  in  the  initial  signal-to-symbol  analyses. 
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Chapter  1 

Background  and  Problem  Statement 


1 


In  recent  years,  a renewed  attack  has  been  made  on  the  problem  of  input  of  human 
speech  to  computers.  [New71,Red75b]  This  dissertation  is  particularly  concerned  with  one 
component  of  this  problem  --  the  initial  analysis  of  the  acoustic  Input.  A great  deal  of  our 
understanding  of  this  problem  has  come  from  areas  such  as  linguistics,  physiology, 
acoustics,  and  psychology.  Computer  science,  and  in  particular  artificial  intelligence,  has 
played  a catalytic  role  in  drawing  together  knowledge  from  diverse  sources  into  workable 
structures.  Common  to  all  these  structures  is  a component  which  deals  with  the  acOiJstic 
input  in  some  parametric  form.  From  that  component  we  expect  an  initial  isolation  or 
identification  of  the  information  borne  by  the  acoustic  signal.  In  this  thesis  we  focus  on 
this  essential  element,  its  inherent  problems,  the  issues  involved  in  its  Implementation,  and 
Its  role  in  a total  systen,. 

1.1  Introduction 

The  basic  vehicle  for  this  research  is  the  problem  of  choosing  a parametric 
representation  for  the  acoustic  signal  which  is  to  be  input  to  a speech  understanding 
system.  The  choice  must  ultimately  be  made  by  the  individual  system  designer  for  there 
is,  as  yet,  no  one  clearly  superior  parametric  rr  rescntation  that  serves  the  variety  of 
purposes  of  segmentation,  phonetic  analysis,  prosodies,  ate.  which  are  needed  to 
understand  general  continuous  speech.  Up  to  this  point,  the  prospective  system  builder 
has  made  the  choice  in  an  ad  hoc  manner.  Either  certain  hardware  was  already  available, 
or  the  necessities  of  cost  and/or  time  prevailed.  In  other  cases,  representations  were 
based  upon  traditional  methods.  In  those  cases  where  a parametric  representation  was 
developed  from  first  principles,  those  principles  have  consisted  of  limited  empirical  studies, 
often  influenced  by  the  element  of  human  speech  understanding  ability,  or  they  have  ueen 
based  upon  simplified  assumptions  about  the  physical  or  stochastic  nature  of  human 
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speech.  In  short,  we  are  faced  with  a number  of  different  methods  for  extracting  acoustic 
parameters  from  the  speech  signal,  All  are  based  upon  reasonable,  but  not  complete, 
understanding  of  the  nature  of  the  speech  signal,  Some  make  trade-offs  with  speed  and 
cost  which  may  not  be  suitable  or  necessary.  Many  have  been  employed  in  speech 
understanding  systems  of  varying  compiexity  and  success.  Some  can  be  shown  to  support 
re-synthesis  of  speech.  But  very  few  have  been  comparativeiy  examined  in  the  light  of 
their  eventual  use  In  a total  system,  (See  [Fla72]  for  a survey  of  speech  analysis  and 
synthesis  techniques.) 

In  order  to  mike  the  comparisons  so  that  they  will  be  useful  to  the  speech  system 
designer,  three  problems  must  be  considered.  1)  The  role  of  the  acoustic  information  and 
knowledge  about  acoustic-phonetics,  in  the  context  of  the  entire  system,  should  be 
understood.  2)  The  method  by  which  the  acoustic  parameters  are  analyzed  — the 
recognition  scheme  --  should  be  chose,n  with  care.  3)  The  performance  statistics  must  be 
designed  to  convey  sufficient  information  about  the  abilities  of  a parametric  representation 
to  support  recognition.  The  information  is  needed  by  the  dei.igner  to  predict  what  the 
choice  will  mean  in  terms  of  his  system. 

This  chapter  is  a statement  of  the  problem  to  be  attacked.  As  such,  it  must  survey 
the  terrain  before  proceeding.  In  the  following  section,  we  will  discuss  those  aspects  of 
speech  understanding  systems  which  seem  to  be  relevant  to  the  question  of  system  use  of 
the  acoustic  parameters.  Section  1,3  is  a look  at  the  uses  to  which  the  acoustic  knowledge 
ilseif  is  put  --  what  kind  of  processing  will  be  needed  depends  upon  what  kind  of 
information  about  an  utterance  is  required  at  the  acoustic  level.  Section  l.A  is  a survey  of 
the  available  methods  for  extracting  parametric  representations  of  a speech  aignai,.  And 
the  final  section  states  the  specific  problem  in  terms  of  the  limitations,  assumptions,  and 
performance  dimensions  chosen  for  study. 

In  succeeding  chapters,  we  will  present  a very  brief  survey  of  pattern  classification 
ideas  and  methods,  chapter  2,  since  these  concepts  are  so  basic  to  the  type  of  analysis 
done  at  the  parametric  le”el  of  speech  understanding  systems.  Chapter  3 will  discuss 
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aspects  of  the  pattern  classification  problem  particularly  relevent  to  speech  recognition. 
In  addition,  a brief  survey  of  the  acoustic/phonetic  processing  of  a number  of  current 
systems  is  included  in  chapter  4 to  provide  some  context  for  this  work  and  for  other 
results  in  the  area,  as  well  as  to  provide  some  idea  of  the  currently  available  technology 
and  performance. 

Chapters  5 and  7 will  present  and  discuss  the  methods  for  segmentation  end 
labeling,  respectively,  used  in  this  research  Chapters  6 and  8 will  present  the 
methodology  for  evaluation  of  performance  and  the  results  obtained.  Finally  chapter  9 is  a 
concluding  discussion  which  will  serve  to  focus  attention  on  the  most  important  elements 
of  this  work,  and  will  provide  an  appropriate  overall  view  for  evaluating  results  of 
research  at  the  level  of  acoustic-parametric  analysis. 

1.2  Speech  Understanding  Systems 

In  this  section,  we  will  discuss  Speech  Understanding  systems.  Speech 
Understanding  invioves  the  input  of  a speech  utterance,  the  extraction  of  relevent 
linguistic  information  from  the  acoustic  input,  and  the  decoding  of  that  information  into 
some  meaningful  construct.  A distinction  is  often  made  between  Speech  Recognition  — the 
process  of  extracting  information  by  the  use  of  knowledge  about  speech  — and  Speech 
Understanding  --  whore  knowledge  about  the  meaning  of  the  utterance  may  be  used  to 
decode  it.  The  purpose  of  the  discussion  is  to  provide  enough  of  an  overall  picture  of 
these  systems  that  the  acoustic  analysis  problem  can  be  seen  in  perspective  to  the  total 
problem.  Since  there  is  little  difference  for  our  purposes  between  these  two  types  of 
speech  system,  we  will  use  the  terms  interchangeably. 

At  first  giance,  the  problem  of  understanding  the  role  pla/ed  in  speech 
understanding  systems  by  acoustic  parameters  might  seem  to  be  insurmountable.  Clearly 
different  systems  will  use  their  acoustic  knowl«-*ge  sources  differently.  Their  other  parts 
will  Interact  with  each  other  in  very  different  fashions.  Errors  fatal  to  some  systems  might 
be  easily  corrected  by  others.  However,  this  apparent  lack  of  any  unifying  model  of  a 
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speech  understanding  system  is  not  total.  One  may  assume  some  structure  and  limitations 
for  the  purpose  of  studying  systems  to  be  developed  now  and  in  the  near  future.  There 
is  no  clear  model  of  what  a speech  understander  should  look  like  except  the  humrn  model, 
which  is  not  descnbable  to  any  great  extent  as  yet.  The  information  we  do  have  about 
human  speech  is  structured  into  well  defined  theories  or  levels,  and  this  structure  can  tell 
a lot  about  the  form  that  speech  systems  will  take  and  the  role  that  acoustic  (parametric) 
knowledge  and  analysis  will  play  *n  them.  The  variations  among  systems  become,  in  this 
view,  more  questions  of  degree  than  of  essential  differences.  How  much  weight  does  one 
give  to  semantically  based  inferences  about  the  utterance?  How  powerful  a model  of  the 
speaker  is  available?  etc.  The  answers  to  such  questions  of  relativa  merit  of  the  various 
typos  of  knowledge  about  speech  and  speakers  gives  flesh  to  the  skeleton  structure  of 
the  different  levels.  Then  a control  structure  tor  handling  interactions  amor.g  the  levels  la 
imposed  so  that  errors  can  be  detected  quickly,  work  can  be  shared  and  efficiently 
performed,  and  the  knowledge  source  most  likely  to  succeed  can  be  invoked  In  any 
situation. 

1.2.1  Sources  of  Knowledge 

In  their  report  on  speech  understanding  systems,  Newell  et  al  [New71]  point  out 
the  relevance  of  the  levels  commonly  accepted  by  linguists  and  phoneticians  to  questions 
of  system  structure  and  control.  It  is  important  to  note  that  every  system  developed  to 
date  has  a number  of  internal  representations  of  the  input  utterance.  These 
representations  correspond  to  the  levels  of  discourse  in  speech  science  such  as  the 
acoustic,  phonetic,  lexical,  syntactic,  and  semantic.  Working  at  various  levels  are  sources 
of  knowledge  about  speech  which  serve  to  translate  from  one  representation  to  another. 
In  those  processes,  such  recognition  activities  as  search,  classification,  error  correction, 
hypothesizing,  and  verifying  may  occur,  (see  figure  1.1)  A source  of  knowledge  at  the 
word  level,  for  example,  may  initiate  a lexical  search  to  convert  a phonetic  sequence  into  a 
word.  Or  it  may  be  used  to  generate  a sequence  of  phones  to  be  verified  or  matched 
against  the  input  at  any  of  a number  of  lower  levels. 
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Figure  1.1:  Levels  and  Knowledge  Sources  in  HSII 


Using  the  common  data  representations  and  speech  knowledge  of  traditional  theory 
to  relate  speech  understanding  systems  to  one  another,  one  can  hope  to  draw  some 
conclusions  {albeit  general  ones)  about  the  role  of  acoustic  Knowledge  sources  and  data  In 
an  entire  system.  People  often  conceptualize  the  structure  of  speech  understanding 
knowledge  application  as  one  of  a linear  flow  through  the  levels.  Either  bottom-up  or  top- 
down  strategies  of  search  allow  decisions  (and  errors)  to  be  transmitted  and  transduced 
through  the  levels  in  a rather  straightforward  manner.  However,  the  interactions  among 
levels  may,  in  general,  be  complex.  One  cannot  assume  any  particular  form  for  the  control 
flow  of  such  systems,  but  we  will  briefly  discuss  below  a number  of  forms  that  have  been 
applied  to  speech  understanding  systems. 

Two  data  representations  common  to  many  systems  are  the  acoustic  parameters  and 
a phonetic-like  transcription.  The  knowledge  sources  that  we  are  investigating  In  this 


Background  and  Problem  Statement 


6 


dissertation  are  partly  responsible  for  translating  from  the  former  to  the  latter.^  Some 
limited  word  recognition  systems  have  shown  great  success  bypassing  the  phonetic 
Transcription  and  recognizing  words  directly  from  the  acoustic  input  parameters.  It  is, 
however,  generally  agreed  that  such  techniques  fail  with  connected  speech  for  a number 
of  reasons.  (For  one,  the  'ack  of  word  boundaries  will  cause  an  exponential  increase  in 
the  size  uf  the  recognition  pattern  storage  required.) 

In  most  systems  for  understanding  general  continuous  speech,  the  processes  which 
apply  knowledge  about  the  acoust'c  and  phonetic  nature  of  speech  gestures  to  the  task  of 
producing  phonetic  transcriptions  of  the  signal  play  a very  important  role.  Essential  to 
this  task  is  some  form  of  classification  scheme  and  some  process  for  segmenting, 
regardless  of  the  manner  ’n  which  these  two  processes  interact.  Segmentation  may 
proceed  and  be  independent  of  classification  (labeling),  A label  may  be  chosen  at  regular 
short  intervals  and  segmentation  precede  on  the  resultant  string.  Or  the  two  processes 
may  operate  on  the  sa.me  data  and  Interact  to  support  or  reject  each  other’s  decision.  In 
any  case,  a parametric  representation  which  does  not  reflect  a particular  acoustic  cue  of 
segment  boundary  will  produce  segmentation  errors,  and  one  which  maps  different 
acoustic  realizations  of  phones  into  the  same  parameters  with  produce  labeling  errors. 

1.2.2  Some  Control  Structures 

Although  there  is  no  one  structure  for  interactions  among  the  knowledge  sources  of 
a speech  understanding  system,  there  are  a few  paradigms  of  such  Interactions  which  have 
been  proposed  and  applied  to  working  systems.  All  of  these  paradigms  deal  with 
information  about  the  input  utterance  represented  internally  at  a number  of  levels  in  some 
Incomplete,  possibly  errorful,  data  structure. 

Systems  organized  to  interact  In  a linear  manner  tend  to  be  susceptible  to  error 
propagation  through  the  levels.  However,  subsystems  of  a number  of  speech  recognizers 

t Other  sources  of  knowledge,  concerned  with  phonetics,  coarticulation,  and  stress  for 
example,  are  needed  to  deal  with  truly  general  speech.  To  deal  with  this  straightforward 
translation,  it  appears  that  classification  based  on  acoustic  patterns  alone  is  not  powerful 
enough. 
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do  obey  linear  control  flow  where  a sequence  of  separate  sources  of  knowledge  each  act 
upon  the  previous  one’s  output  and  feedback  is  initiated  only  from  certain  levels.  An 
example  is  In  Hearsay  I where  broad  classification,  segmentation,  fine  classification,  and 
lexical  search  are  linearly  invoked,  but  feedback  only  results  from  higher  levels  initiating  a 
different  lexical  search.  [Erm74b]  This  is  a case  where  everything  that  can  be  done  In  the 
general  area  of  acoustic  analysis  of  the  utterance  is  done  immediately.  Thus,  there  Is  no 
purpose  to  invoking  any  inter-level  paths  other  than  the  straightforward  one  that  reduces 
the  reprosentation  to  the  highest  level  data  structure  used  in  the  system. 

An  early  paradigm  for  speech  recognition,  suggested  by  Halle  and  Stevens  [Hal62], 
IS  Analysis-by-Synthesis.  A representation  of  the  input  is  postulated  at  soma  level  and 
the  sources  of  knowledge  are  used  to  create  a corresponding  representation  at  another 
lower  level  to  be  compared  with  the  Input.  Some  measures  of  closeness  of  the  two 
representations  at  the  lower  level  are  used  to  decide  upon  the  "truth"  of  the  higher  level 
assertion.  Again,  a linear  system  structure  is  likely  to  be  used  here  since  the  point  at 
which  feedback  Is  initiated  is  at  the  low  level  comparison,  after  a sequence  of 
transformations  of  the  represented  synthetic  utterance.  Analysis-by-Syntheais  can  also 
be  applied  in  subsystems  where  the  rules  are  available  in  a powerful  but  generative  form, 

and  the  size  of  the  search  for  the  correct  representation  to  synthesize  la  not  excessively 
large.  [Kla75] 

The  Hearsay  system  paradigm  of  Hypothesize  and  Test  [Red73]  is  similar  to,  but 
more  general  than,  Analysis-by-Synthesis.  The  test  need  not  be  a comparison  of  two 
structures  at  the  same  level.  In  fact,  the  test  will  most  often  be  constructed  to  compare 
only  those  parts  of  the  representation  which  may  feasibly  differ  In  a teleological  sense  (in 
the  sense  that  they  might  lead  to  different  results  at  the  higher  levels).  Flow  of  control 
among  the  levels  Is  much  less  constrained,  and  consequently  the  interactions  are  more 
complex. 

Various  parts  of  speech  understanding  systems  may  be  treated  as  heuristic  searches 
in  the  sense  that  a universe  of  feasibie  solutions  (interpretations  for  the  Input 
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representation  to  the  subsystem)  is  being  searched  by  application  of  specialized  rules 
dependent  upon  the  current  point  in  the  universe  being  investigated.  Knowledge  sources 
that  allow  representations  at  one  level  to  be  recognized  only  as  legal  representations  at 
another  higher  level  are  the  operators  that  traverse  the  universe  of  solutions.  Heuristics 
for  applying  the  operators  may  be  explicit  in  some  scoring  mechanism  or  implicit  in  the 
knowledge  sources.  (E.g.,  when  a key  syntactic  element  is  discovered,  it  Is  reasonable  to 
generate  the  surrounding  modifiers  or  function  words.) 

Dynamic  programming  techniques  have  been  successfully  applied  to  simple,  powerful 
systems  for  word  or  short  phrase  recognition.  [Ita75,  Fu68,  Ich73,  Whl75]  Usually  a single 
source  of  knowledge  --  an  acoustic  clasvifying  scheme  --  Is  used  within  the  dynamic 
programming  algorithm  to  find  the  best  fit  among  a number  of  stored  templates.  The 
dynamic  program  provides  the  ability  tc  adjtwl  time  durations  of  the  various  segments  to  a 
limited  degree  without  explicitly  segmenting.  This  is  a very  powerful  technique  for  short 
utterances  from  a limited  set  and  may  be  used  as  a component  within  a speech 
understanding  system. 

Baker  [BakJK75b]  presents  the  Hidden  Markov  process  as  a model  for  recognition  at 
each  of  a number  of  levels,  implemented  as  a dynamic  program.  Flow  of  control  in  his 
system  is  handled  by  the  probabilistic  model  itself.  An  underlying  representation  of  each 
level  is  hypothesized  as  a Markov  sequence  which  best  fits  the  observed  representation. 
At  each  level,  elements  of  the  lower  level  representation  may  stand  for  realizations  of 
elements  of  the  representation  in  view.  These  latter  are  connected  In  a standard  Markov 
chain.  The  probability  of  a realization  is  a combination  of  the  underlying  (higher  level) 
chain's  probability  and  the  individual  realization  probabilities.  The  translation  of  the 
underlying  sequence  to  that  of  a higher  level  is  much  simpler  since  It  is  more  highly 
constrained  than  the  observed  representation. 

Our  purposes  in  briefly  discussing  these  models  of  system  interaction  are  twofold. 
First,  one  can  see  that,  inherent  in  all  the  systems  thus  far  developed,  there  Is  the  action 
of  translating  a piece  of  one  level’s  data  structure  into  that  of  another.  At  the  acoustic 
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level,  this  almost  always  means  some  form  of  classification  of  a short  Interval  of  the 
acoustic  representation  into  one  of  a number  of  phone-like  classes.  Measuring  the 
performance  of  such  an  action  for  the  different  acoustic  representation  schemes  wi'l, 
therefore,  provide  information  relevant  to  the  performance  of  the  vast  majority  of  speech 
understanding  systems.  The  second  purpose  is  to  point  out  the  feasibility  of  using  models 
of  performance  of  knowledge  source.s  in  an  analysis  of  the  entire  system’s  performance. 
Although  the  control  paradigms  affect  the  order  of  applying  the  different  knowledge 
sources  and  the  amounts  of  effort  wasted  on  false  paths  or  bad  hypotheses,  the 
progression  of  the  correct  representation  through  the  levels  is  universal.  Some  piece  of 
the  input  signal  will  have  oeen  transformed  by  a sequence  of  classifications  into  either  a 
phonetic  sequence,  a word,  or  a phrase  element.  In  continuous  speech  systems,  further 
transformations  will  have  eventually  carried  these  elements  to  u single  semantic  or  task 
related  construct.  While  the  entire  system  analysis  may  require  simulation,  if  no  analytic 
model  is  available,  the  individual  knowledge  sources  are  separable  and  their  effects  on 
system  performance  are  separable. 

1.2.3  Human  Performance 

A great  deal  has  been  written  about  all  aspects  of  human  perception  of  speech,  and 
we  cannot  even  survey  what  is  known  or  postulated  about  the  structure  and  Interactions 
of  knowledge  within  the  human  speech  understanding  system.  However,  the  existence  of 
human  speech  perception  under  all  manner  of  difficulties  and  limitations  does  point  to 
ways  “jf  analyzing  individual  knowledge  sources  for  their  role  in  the  total  picture. 

Experiments  in  perception  of  words  under  noisy  conditions  have  quantified  to  some 
extent  the  role  of  semantic  support  in  disambiguating  errorful  inputs.  [Bru56]  In  a like 
manner,  errors  in  perception  are  correlated  with  ungrammaticality  to  measure  the  role  of 
syntax.  An  experiment  involving  unfamiliar  languages  [Sho74a]  has  shown  some 
interesting  results  as  far  as  the  accuracy  of  human  phonetic  recognition  is  concerned.  In 
this  last,  expert  phoneticians  are  presented  with  utterances  in  a number  of  languages 
whose  words,  syntax,  end  phonology  are  totally  unfamiliar  (Turkish,  Cantonese,  Swedish, 
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etc.).  They  are  asked  to  produce  as  accurate  a phonetic  transcription  as  possible  from 
listening  to  the  recorded  utterances  or  from  observing  graphical  displays  such  as  sound 
spectrograms  or  oscillograms.  Very  briefly,  using  auditory  input  the  subjects  achieved 
about  50/!  recognition  at  the  phonetic  level,  with  a chuoice  of  about  50  phone-Ilke  labels. 
With  oscillogram  or  spectrogram  input  only,  accuracy  was  about  25t  The  results  indicate 
that  the  acoustic  knowledge  source  in  human  perception  Is  not  much  better  than  the  best 
machine  procedures  currently  available.  The  human  perceiver  is  much  more  adaptable  and 
more  robust  over  a wide  range  of  conditions  than  the  machine  at  this  level.  But  It  seems 
entirely  likely  that  present  techniques  could,  under  favorable  conditions,  perform  the 
foreign  language  experiment  as  well  as  the  human  subjects. 

There  is  disagreement  on  whether  higher  level  knowledge  or  low  level  recognition 
techniques  are  the  bottleneck  at  this  point.  It  is  our  opinion  that  there  is  much  more  to 
gained  from  improvements  to  higher  level  knowledge  sources.  This  does  not  stop  us  from 
continuing  to  improve  the  acoustic  level  procedures  svailable,  until  they  are  as  good  or 
better  than  hi  man  ability,  but  it  does  point  out  the  need  for  a clear  understanding  of  their 
performance  characteristics.  With  such  an  understanding,  system  design  efforts  may  be 
best  directed,  and  the  results  Of  improved  higher  levels  will  he  recognized. 

1.2.4  Summary 

We  have  presented  a picture  of  speech  understanding  systems  as  collections  of 
separable  sources  of  knowledge,  with  representations  of  the  speech  signal  occuring  at  a 
variety  of  levels.  The  manner  of  interaction  among  these  knowledge  sources  is  of  varying 
importance  in  analyzing  their  performance.  Our  view  Is  that  the  acoustic  level  processes 
are  particularly  easy  to  separate. 
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1.3  Acoustic  Level 

This  section  will  discuss  the  role  that  acoustic  knowledge  can  play  in  a speech 
understanding  system  and  the  types  of  decisions  that  can  be  supported  by  it.  The 
parametric  representation  of  the  utterance  to  be  understood  may  be  considered  as  the 
raw  input  of  the  system.  Most  Iiigher  level  knowledge  is  not  expressed  in  terms  of  this 
representation.  For  this  reason  as  well  as  the  quantity  of  data  tha  is  input,  some  serious 
reduction  of  the  amount  of  data  and  some  translation  into  another  'epresentation  are  the 
primary  requiri  ments  of  this  level.  In  addition,  the  system  needs  a reasonably  powerful 
way  to  begin  its  search  for  a solut'on.  In  some  situations,  semantics  or  syntax  m»y  be 
able  to  provide  such  a handle,  but  often  one  must  rely  upon  the  acoustic  input  to  make  an 
initial  hypothesis  from  which  the  rest  of  the  system  may  proceed.  These  three  actions  -- 
data  redaction,  translation,  and  hypothesis  generation  are  the  most  common  uses  for 
acoustic  level  analysis  in  speech  understanding  systems, 

The  two  types  of  processing  that  are  typically  applied  are  segmentation  of  the 
utterance  into  quasi-phonetic  segments  and  labeling  of  those  segments  with  Information 
interpretable  by  higher  levels  --  usually  indentifying  phone-like  sounds.  Although  the 
production  of  an  actual  phonetic  transcription  might  involve  a number  of  sources  of 
knowledge  concerned  with  coarticulation,  pl'onetics,  prosodies,  etc.,  an  initial  translation 
inic  a sequence  of  acoustically  separate  segments  and  their  classification  into  types  of 
speech  sounds  can  provide  a reasonable  first  approximation  at  a transcription.  [GolH7fl]  It 
Is  our  contention  that  a simple  segmentation  and  labeling  scheme  can  be  used  In  this 
comparison  study.  That  is  not  to  sry  that  the  limits  of  acoustic  knowleoge  sources  are 
such  simple  schemes,  but  rather  that  these  two  basic  processes  are  elementary  processes 
that  more  complex  algorithms  will  depend  upon.  It  is  also  an  assertion  that  the  primary 
rolo  of  acoustic  level  analysis  is  satisfied  by  these  two  processes.  The  following  brief 
discussions  should  give  a better  idr-a  of  both  the  kind  of  processing  to  be  done  with 
parametric  representations  of  speech  and  the  roles  that  the  results  of  such  processing 
play  in  the  whole  system.  ' 
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i.3.1  Segmentation 

The  segmentation  process  is  conceptually  simple  --  to  find  the  boundaries  in  time 
between  the  different  sounds  that  make  up  an  utterance.  The  difficulty  seems  lie  in 
defining  what  is  meant  by  "different  sounds".  In  a phonetic  or  phonemic  segmentation, 
some  segments  are  essentially  steady  state  in  their  acoustic  characteristics,  others  are 
continuously  varying  or  transitionary  in  nature,  and  some  are  composites  of  two  or  three 
sounds  of  either  type.  An  acoustic  segmentation,  on  the  other  hand,  separates  the  input 
into  portions  within  which  the  acoustic  character  is  consistent.  Transitionary  sounds  will 
still  present  a problem.  For  example  (see  figure  1.2),  the  sound  /I/  displays  a time  varying 
resonant  structure,  as  does  the  initial  portion  of  a vowel  following  a /g/  or  the  middle 
portion  of  a diphthong.  Yet  only  in  the  first  case  would  everyone  agree  that  a separate 
sound  must  be  identified  am^  set  apart  from  its  neighbors.  Clearly,  the  fineness  of 
resolution  to  which  one  requires  segmentation  be  done  depends  upon  the  final  uses  one 
has  for  a machine  transcription  of  the  utterance.  If  differentiation  of  words  is  done  by 
crude  identification  of  consonants  and  careful  analysis  of  the  most  stressed  vowel,  for 
example,  then  segmentation  should  be  biased  toward*  identifying  the  long  steady  state 
portions  as  single  segments,  even  at  the  cost  of  losing  some  consonant  segments.  If 
consonants  are  identified  by  their  coarticulative  effects  upon  neighboring  phones, 
transitionary  portions  become  very  crucial  and  must  be  located.  In  general,  the  commonly 
accepted  phonemes  of  English  (or  whatever  language  is  being  spoken)  give  an  idea  of  the 
degree  of  resolution  needed  for  most  analyses.  If  the  segmenter  can  separate  those 
portions  of  the  signal  most  likely  to  be  associated  with  the  phonemes  that  make  up  the 
utterance,  then  small  variations  in  how  diphthongs,  plosives,  etc.  are  treated  are  not 
critical.  If  the  speech  understanding  system  relies  upon  a set  of  labels  for  sounds  that  are 
considerably  different  from  the  phonemes,  the  segmentation  must  be  able  to  separate 
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those  sounds  robi  stly.^ 


' ••  • 

ill 


Figure  1.2;  Some  Time-Varying  Segments 


1.3.2  Labtllng 

The  labeling  process  is  the  central  pattern  recognition  process  at  the  acoustic  level. 
The  parametric  representation  of  an  input  segment  of  speech  Is  labeled  with  an  Indicator 
of  the  Information  it  is  deemed  to  be  carrying.  Until  some  such  labeling  Is  accomplished, 
the  sequence  of  segments  may  be  any  sequence  at  all.  Thus  10  segments,  each  of  which 
may  be  any  of  30  types  of  speech  sound,  represent  30^^  possible  transcriptions  of  the 
Input.  The  labeling  process,  by  reducing  the  30  choices  to,  say,  3,  can  reduce  the  search 
by  a factor  of  10^^.  This  possibility  of  reducing  the  exponential  search  size  Is  due  to  the 
fact  that  the  acoustic  labeling  and  segmenting  are  applied  first,  when  little  else  is  Known 
about  the  utterance,  and  that  the  vast  majority  of  representations  at  this  level  are  Illegal 
at  higher  levels  and  would  never  have  been  produced  by  the  speaker  In  the  first  place. 

t One  segmentation  process  in  Hearsay  I picks  out  voiced,  fricated,  and  silent  segments 
only.  A later  process  may  subdivide  these  segments  upon  more  detailed  analysis. 
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The  isoue  of  what  are  the  label',  that  are  to  be  placed  upon  the  input  utterance  Is 
an  issue  involving  the  design  of  a m mber  of  levels,  Whether  labels  are  to  be  considered 
as  distinct  classes  or  as  regions  in  a continuous  space  of  labeiing  information  is  central  to 
the  choice  of  whether  to  recognize  phonetic  features  or  phone-iike  gestures. 
Segmentation  may  be  accomplished  by  labeling  at  regular  short  Intervals  and  then  marking 
boundaries  at  maximai  changes  in  the  labels.  In  such  a situation,  the  iabei  set  must  refiect 
such  a goal,  The  dynamic  programming  model  that  is  used  as  a word  recognition  system 
by  Itakura  and  others  iabels  an  entire  short  utterance  as  a word.  The  primitive  operation 
in  that  case  is  a pattern  recognition  measure  which  determines  how  close  a fit  a short 
Interval  in  the  input  word  makes  with  the  stored  template.  Even  in  such  a system,  where 
there  Is  no  actual  phone-like  labeling  being  done,  the  primitive  action  of  comparing  two 
patterns  for  likeiy  identity  is  basic.  Chapter  2 will  discuss  the  pattern  classification  model 
and  a number  of  methods  for  solving  simple  recognition  prot>lems  within  that  generai 
modei. 

1 .3.3  Data  Reduction 

A typical  digitized  signal  contains  at  least  lOK  samples  per  second,  where  each 
sample  should  be  at  least  9 bits,  probably  more.^  The  parameters  extracted  from  the 
signal  may  reduce  this  data  rate  considerably.  Spectrograms  offer  no  reduction  per  se, 
aithough  the  iocations  and  amplitudes  of  spectral  peaks  (formant  tracking)  represent 
approximately  an  order  of  magnitude  saving.  Typicai  anaiog  fiiter  banks,  digitized  every 
10ms.,  offer  the  same  order  of  magnitude  reduction.  However  one  is  stiii  faced  with 
perhaps  lOK  bits  per  second,  and  only  the  most  straightforward  anaiysis  can  keep  up  with 
such  a data  rate.  Thus,  an  Important  role  of  the  acoustic  analysis  levei  is  the  reduction  of 
the  input  data  rate  to  an  amount  manageable  by  the  higher  levels,  where  interactions, 
backtracking,  and  more  complex  analysis  will  preclude  large,  redundant  data 
representations.  Mereiy  labeling  each  10ms.  interval  with  one  of  a set  of  about  50  labels 


t In  fact,  16-bit  accuracy  or  a floating  point  scheme  is  needed.  In  dealing  with  9-bit  data, 
our  experience  has  been  that  not  enough  dynamic  range  is  available.  Either  stressed 
vowels  ere  clipped,  or  unstressed  nasals  lack  any  waveform  structure. 
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reduces  the  rate  to  600  bits/second.  Further  reduction  is  available  by  forming  segments 
with  one  label  for  a longer  duration  of  signal  (typically  from  10  to  200ms.,  usually  50  to 
100ms.).  However,  this  latter  saving  may  be  spent  on  multiple  labels,  rating  schemes,  and 
certain  special  parameters,  such  as  overall  amplitude,  which  may  be  useful  to  other 
knowledge  sources.  The  data  in  its  ne»‘'  form  is  not  only  more  compact,  but  also  much  less 
redundant. 

1 .3.4  Translation 

It  has  often  been  pointed  out  that  a problem  in  applying  much  of  the  codified 
knowledge  about  speech  is  that  it  exists  in  terms  of  generative  rather  than  analytic  rules. 
However,  another  serious  problem  in  applying  such  knowledge  is  that  the  rules  are  written 
in  terms  of  very  different  primitives.  For  example,  syntax  is  often  understood  in  terms  of 
lexemes  --  words  or  endings  of  wordsj  coarticulation  rules  are  In  terms  of  phones  or  other 
perceptual  features.  The  difficulty  is  that  making  a clear  and  universal  correlation 
between  such  elements  and  another  representation,  such  as  the  acoustic  parameters,  is 
not  possible.  (That  is  what  speech  recognition  is  all  about.)  Clearly,  some  initial  translation 
must  be  made  from  the  acoustic  parameters  to  some  other  representation  better  suited  to 
application  of  these  rules.  Most  system  designers  have  chosen  the  new  representation  to 
be  some  form  of  phonetic  label^  although  this  need  not  be  the  case.  The  new 
representation  may  consist  of  entirely  heuristic  elements,  pseudo-phonemes,  or  even,  as  in 
some  word  recognition  systems  [Ich73,  Ita75]  entire  words.  The  latter  case  is  orw  where 
no  other  knowledge  is  applied  to  the  utterance  except  the  acoustic  matching  In  the  context 
of  a dynamically  adjusted  time  scale.  The  point  to  be  made  is  that  the  role  of  acoustic 
level  translation  Is  determined  by  the  data  structures  of  the  other  sources  of  knowledge. 

t The  term  "phonetic"  carries  implications  of  more  human  perception  orientation  than  is 
usually  available.  Indeed,  one  could  argue  that  machine  labels  merely  represent  classes  of 
sounds  with  certain  acoustic  characteristics.  They  are  no  more  phonetic  or  phonemic  In 
nature  that  any  other  soundr  However,  it  is  usuaily  the  goal  in  defining  these  classes  to 
pick  sounds  whose  acoustic  characteristics  correlate  highly  with  phonetic  or  even 
phonemic  information.  In  this  sense,  machine  labels  can  carry  both  acoustic  and  phonetic 
nformation. 
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Most  systems  have  adopted  a phonetic  data  representation  at  some  level.  Even  If  a 
system  has  no  such  representation,  translation  still  occurs  in  some  form,  from  the 
parametric  representation  to  some  other  representation. 

1.3.5  Hypothesis  Creation 

In  a system  which  attempts  to  develop  a partial  representation  of  the  utterance  at 
higher  levels,  the  key  to  successful  recognition  is  often  the  ability  to  create  a "handle" 
early  in  the  process.  Figure  1.3  shows  an  example  of  hypotheses  created  in  Hearsay  II. 
Some  phoneme,  word,  or  phrase  is  recognized  with  high  confidence,  and  the  search  spaces 
of  a number  of  different  levels  are  significantly  reduced,  In  addition,  many  rules  of  both 
generative  and  analytic  nature  deal  with  elements  in  some  limited  context,  so  that 
inference  can  only  be  made  when  some  such  context  is  available.  It  is,  therefore,  an 
Important  role  of  the  acoustic  knowledge  sources  to  provide  initial  hypotheses  about  the 
utterance  from  which  inferences  may  be  carried  forward,  verified,  or  altered.  Some 
system  structures,  such  as  Analysis-by-Synthesis,  do  not  proceed  in  this  fashion.  Rather, 
the  entire  utterance  is  generated  or  stored  as  a template  and  s complete  test  Is  made. 
Most  implementations  of  such  methods  are  restricted  to  particular  levels  with  more  fle.<ibl# 
overall  control  of  the  system;  then  the  results  of  such  tests  are  used  on  only  limited 
portions  of  the  utterance.  It  is  generally  accepted  that  systems  (in  order  to  be  robust  In 
the  presence  of  errors)  will  require  the  ability  to  create  hypothetical  recognitions  and  to 
alter  them  as  new  information  is  discovered.  Therefore,  the  acoustic  level  results  will 
have  to  be  viewed  as  an  important  source  of  such  initial  hypotheses  or  at  least  as  the  first 
source  of  verification  decisions.  Issues  such  as:  how  confident  one  can  be  in  a particular 
piece  of  the  result,  how  often  a really  solid  handle  is  found,  and  how  errors  will  affect  the 
usefulness  of  the  results  as  hypotheses  for  the  rest  of  the  system,  become  important  to 
the  analysis  of  performance  and  the  prediction  of  merit  to  a working  system  of  an  acoustic 
level  recognition  scheme. 
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1 .3.6  Summary 

We  have  seen  the  two  processes  of  Segmentation  and  Labeling  and  their  roles  in 
data  reduction,  translation,  and  hypothesis  creation.  Any  knowledge  source  which 
provides  these  functions  from  acoustic  to  higher  level  representations  Is  a satisfactory 
candidate  for  use  in  a speech  understanding  system. 

1.4  Parametric  Representations 

The  parametric  representation  of  the  acoustic  signal  is  the  basic  input  to  the  entire 
system.  The  choice  of  a good  method  for  representing  the  utterance  at  this  level  has 
been  the  subject  of  a great  deal  of  research,  conjecture,  and  rationalizing.  Even  though 
very  little  investigation  of  the  choice  itself  has  been  done  (see  Ichikawa  for  an  example 
[Ich73]),  a number  of  parameterizations  have  been  developed  from  theoretical  models  of 
the  vocal  tract,  from  experience  with  human  perception,  or  from  experience  with  heuristics 
found  to  be  effective  for  machine  recognition  of  speech.  An  extensive  survey  of  ail  the 
representations  for  speech  would  be  beyond  the  scope  of  this  dissertation,  both  because 
of  the  number  of  different  methods  (some  only  slightly  different  from  others)  and  because 
only  certain  representations  appear  to  be  useful  for  recognition.  Reasonably  current  and 
complete  surveys  are  available.  [SchRW75]  This  section  is  intended  to  be  more  a sketch  of 
the  range  of  possible  parameterizations,  and  a statement  of  the.significant  approaches  that 
have  been  taken  to  the  problem  of  designing  a representation,  than  a survey  of  the  field. 

1.4.1  Properties 

The  parametric  representation  should  have  certain  properties  in  order  to  be  useful 
to  a speech  understanding  system.  There  is  a clear  trade-off  among  cost,  either  of 
implementation  in  hardware  or  of  digital  computation  on  a general  purpose  machine,  and 
flexibility  and  small  data  rate.  However,  somewhere  between  representations  that  are 
very  simple  to  extract  (such  as  the  digitized  version  of  the  signal)  and  representations 
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that  are  very  flexible  and  parsimonious  (an  extreme  example  being  a sub-phonemic 
transcription),  is  the  parametrization  best  suited  to  each  system  and  its  resources.  In 
addition  to  properties  relating  to  cost,  size  of  representation,  and  flexibility,  the 
representation  should  be  robust  in  the  sense  of  causing  the  least  fatal  errors  possible. 
This  is  a teleological  property,  since  the  seriousness  of  errors  is  only  determined  after  the 
entire  system  is  applied  to  the  acoustic  parameters.  A major  result  of  this  research  is 
intended  to  be  a better  idea  of  the  relationship  of  phonetic  information  and  the  various 
parametric  representations  under  investigation.  In  one  sense,  much  of  this  question 
reduces  to  understanding  what  regions  of  the  space  of  representations  of  short  speech 
segments  correlate  well  with  useful  information  in  the  utterance,  and  what  regions  are 
likely  to  cause  confusions  because  of  their  "nearness"  in  the  cpace  to  very  different 
information  elements’  patterns.  In  short,  one  hopes  to  find  a representation  which 
preserves  the  acoustic  correlates  of  higher  level  information,  is  robust  in  those 
correlations,  reduces  redundant  information  in  the  signal,  and  is  reasonably  simple  to 
extract  from  the  raw  signal.  These  may  not  all  be  possible  at  one  time,  or  to  the  degree 
desired,  but  they  should  be  considered  in  selecting  a parametric  representation. 

1.4.2  Simple  Parameters 

Given  the  digitized  version  of  an  analog  signal  as  input,  there  are  a number  of 
simple  yet  powerful  measurements  which  can  be  made  on  the  signal.  Within  a short  time 
interval^  where  the  signal  is  assumed  stationary,  the  peak  to  peak  amplitude,  the  positive 
and  negative  peak  amplitudes,  the  period  between  major  peaks,  and  the  number  of  zero 
crossings  in  both  or  either  direction  may  ail  be  extracted.  The  pitch  period,  energy, 
voiced-unvoiced  feature,  and  the  amount  of  high  frequency  micro-structure  on  the 
waveform  may  aii  be  estimated  with  these  parameters.  In  particular,  Baker  [BaKJM75] 
shows  that  a single  event,  the  zero-up-crossing,  when  parametrized  by  the  period 
between  events  and  the  peak  amplitudes  in  that  period  gives  very  good  information  for 


t The  usual  length  of  this  interval  is  from  Sms.  to  15ms.  Clearly  the  longer  the  interval, 
the  great  the  information  reduction.  Most  speech  gestures  take  longer  than  10ms.  to 
complete.  Only  very  short  burst  phenomena  might  be  lost. 
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segmenting  and  identifying  stop  consonants.  With  other  measures,  such  as  a sine-fit  to 
measure  micro-structure  on  the  waveform,  the  parameters  can  support  a general  phonetic 
segmentation  and  labeling  scheme.  Reddy  [Red68]  showed  that  simple  measurements  made 
upon  the  signal  and  its  high  frequency  component  separately,  could  alone  support  a 
reasonable  acoustic  segmentation.  Finally,  even  simpler  measures  can  give  useful 
information.  Schafer  and  Rabiner  point  out  the  usefulness  of  the  deltas  in  adaptive  delta 
modulation  schemes  for  detecting  silence-speech  boundaries.  [SchRW75] 

1.4.3  Spectral  Analysis 

A great  deal  of  phonetic  information  is  known  to  be  encoded  in  the  various 
frequency  components  of  the  speech  signal.  One  often  wants  to  separate  the  components 
of  the  signal  according  to  their  information  content.  This  usually  means,  tor  speech,  a 
transformation  into  the  frequency  domain,  or  seme  separation  of  the  various  frequency 
components  of  the  waveform. 

1. 4.3.1  Filter  Arrays 

T.he  simple  measurements  mentioned  above  may  be  coupled  with  pre-processing  by 
analog  or  digital  filter  arrays  to  produce  a number  of  signals  In  parallel.  Besides 
straightforward  signal  enhancement  by  bandpass  filtering  to  reduce  AC  line  noise, 
digitization  aliasing,  etc.,  there  is  bandpass  filtering  for  the  purpose  of  isolating  separate 
information-bearing  elements  of  the  acoustic  signal.  The  number  and  bandwidth  of  these 
filters  is  the  subject  of  much  discussion.  How  well  do  they  correspond  to  the  formants? 
How  costly  is  the  array  of  filters  to  build  and  to  digitize  (in  money  and  processing  time)? 
The  Hearsay  I system  uses  five  bandpass  filters  of  one  octave  width  from  200  to  6400Hz. 
and  peak  to  peak  amplitude  and  zero  crossing  counts  on  each  band  and  the  unfiltered 
Signal,  (see  figure  1.4)  These  12  parameters  are  extracted  every  lOms.  and  used  in  a 
simple  pattern  classification  scheme  for  the  basic  acoustic  level  knowledge  source.  We  are 
presently  experimenting  with  a set  of  25  narrow  bandpass  filters  which  span  the  range  of 
63  to  16KHz.  with  ten  filters  per  decade,  (figure  1.5)  Many  other  researchers  have  used 
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Figure  1,5;  ASA  Parametric  Representation 
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arrays  of  filters  in  similar  fashion  to  estimate  a spectral  analysis  of  the  signal.  These  two 
sets  of  filters  are  fairly  representative  of  the  types  and  numbers  of  such  filter  arrays  in 
current  use. 

1. 4.3.2  Transforms 

In  addition  to  arrays  of  filters,  whether  analog  or  digital,  the  techniques  of  the  Fast 
Fourier  Transform  may  be  used  to  calculate  the  frequency  domain  transform  of  a digitized 
speech  signal  quite  efficiently.  [Coc67]  Schafer  and  Rabiner  [SchRW75]  give  typical  results 
of  FFTs  of  speech,  and  discuss  the  various  parameters  of  the  algorithm,  length  of  window, 
shape  of  windowing  function,  if  any,  the  kind  of  frequency  resolution  obtainable,  etc.  The 
short  time  spectrum  may  be  used  to  detect  pitch  fairly  well  since  peaks  appear  In  the 
spectrogram  at  harmonics  of  the  fundamental  pitch  frequency.  Other  methods  for  pitch 
detection  are  also  derived  from  the  spectrum,  such  as  the  harmonic  product  spectrum. 
[SchRW75]  A related  method  of  analysis  is  sometimes  called  homomorphic  filtering.  The 
problem  is  to  separate  two  signals  which  have  been  combined  by  multiplication  and 
convolution.  In  speech  processing,  the  central  assumption  is  that  the  signal  is  such  a 
combination  of  the  excitation  source  and  the  vocal  tract  impulse  response  characteristics. 
Without  going  into  details  [0ppf>8]  the  log  of  the  magnitude  of  the  Fourier  transform  is  the 
sum  of  the  logs  of  the  two  contributors.  The  inverse  transform,  being  a linear  operation, 
preserves  the  additive  combination  in  the  result.  Known  as  the  cepstrum.  Because  of  this, 
the  pitch  signal,  the  excitation  source,  may  be  separated  out  and  analyzed.  The  vocal  tract 
impulse  response  may  also  be  analyzed  separately.  This  is  accomplished  by  multiplying 
the  cepstrum  by  a cepstrum  window"  that  only  passes  short-time  components. 

1 .4.4  Linear  Prediction 

A number  of  formulations  of  a method  based  upon  the  prediction  of  a sample  as  a 
linear  sum  of  the  previous  samples  have  been  recently  developed  and  fall  under  the  term 
Linear  Prediction  or  Linear  Predictive  Coding  (LPC).  These  formulations,  all  introduced  to 
the  acoustic  literature  since  1966,  represent  a new  application  of  a method  in  use  by 
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statisticians  and  economists  for  a number  of  decades.  However,  the  recent  extensive  work 
in  this  direction  has  served  to  demonstrate  the  usefulness  of  Linear  Prediction  to  the 
analysis  of  speech  — particularly  formant  estimation  --  and  to  provide  the  speech 
community  with  a number  of  algorithmic  methods  and  the  body  of  theory  to  support  their 
use.  As  Schafer  and  Rabiner  point  out,  the  method  is  extremely  powerful  for  the  accuracy 
v ‘ the  estimated  speech  parameters  it  prov'Jos  as  well  as  for  the  speed  of  computation 
possible. 

1.4.4. 1 Basic  h/othod 

The  basic  idea  is  that,  within  a short  time  interval  (usually  from  5 to  50  ms.)  which  Is 
assumed  stationary,  the  samples  of  the  digitized  signal  may  be  expressed  as  a linear 
combination  of  the  p preceding  samples.  The  squared  error  is  minimized  and  the  least 
squared  optimal  coefficients  for  this  prediction  are  found  by  solution  of  a system  of  linear 
equations. 

Two  formulations,  which  deal  with  slightly  different  treatments  of  the  Interval 
boundaries,  are  known  as  the  Covariance  [Ata71]  and  the  Autocorrelation  [Mar72,  lta68] 
methods.  The  Covariance  method  goes  outside  the  interval  for  the  p samples  needed  to 
predict  the  first  through  pth  samples,  while  the  Autocorrelation  method  assumes  zero 
outside  the  intervs'  In  the  latter  case,  the  interval  must  be  windowed  by  a function  that 
goes  to  zero  smoothly  at  the  boundaries  to  avoid  introducing  the  characteristics  of  a step 
function.  While  the  system  of  equations  for  the  Covariance  method  Is  harder  to  solve 
Atal  has  shown  that  it  requires  fewer  samples  to  achieve  similar  accuracy.  The  saving  In 
terms  of  the  cost  of  calculating  over  fewer  samples  may  be  significant.  Neither  method 
seems  clearly  superior  to  the  other.  Beside  the  original  papers,  an  extensive  comparison 
of  those  methods  is  available  [Mak72]  as  well  as  shorter  discussions.  [SchRW75] 


t The  covariance  system  is  solvable  by  Cholesky  decomposition,  for  example,  with 
approximately  p operations,  while  the  form  of  the  autocorrelation  system  is  known  as  a 
Toeplitz  matrix  and  may  be  solved  by  Levinson’s  method  in  p^  operations. 
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\AA.Z  Parameters 

The  e are  a number  of  types  of  pararrieters  derivable  from  the  Linear  Prodiction 
model.  They  all  rely  upon  the  same  assumptions;  stationarity  over  the  interval,  boundary 
and  window  choices,  and  size,  p,  of  the  prediction  equation.  However,  they  represent  very 
different  kinds  of  information  about  the  speech  signal, 

The  results  of  solving  the  linear  equations  are  p parameters  which  ere  the 
coefficients  of  the  predictor,  or,  as  sometimes  formulated,  the  coefficients  of  ar  inverse 
filter  which  can  reduce  the  signal  to  noise.  Itakura  further  processes  them  to  remove  any 
correlation  between  the  ith  parameter  and  the  remaining  p-i  parameters.  These  are  called 
the  Partial  Correlation  Coefficients  (Parcor)  and  have  been  shown  to  be  an  efficient 
representation  for  analysis  and  re-synthesis  of  speech.  [Ita70,  Ita68] 

In  actual  use  for  speech  recognition,  these  parameters  seem  to  be  deficient  or,  at 
best,  not  robust  enough  for  simple  classification  algorithms.  Ichikawa  et.oi.  [Ich73]  point 
out  that  the  parcor  parameters  must  be  smoothed  to  achieve  a reasonable  recognition 
performance,  and  they  still  are  inferior  to  the  spectrum  envelope.  However,  Itakura 
[Ita75]  has  developed  a decision  procedure  from  the  probabilistic  model  of  the  signal  used 
in  his  LPC  derivations,  and  has  shown  that  the  predictor  coefficients  can  be  used 
effective!’,  for  recognition  of  speech. 

By  far,  the  most  popular  use  of  linear  prediction  is  in  producing  estimates  of  the 
short-time  spectrum  envelope.  The  Fourier  transform  (using  a pruned  FFT)  of  the  linear 
predictor  impulse  response,  just  the  coefficients  themselves,  results  In  a smoothed 
spectrum  envelope  of  the  vocal  tract  response  with  the  effects  of  the  excitation  source 
removed,  (see  figure  1.6)  It  is,  in  fact,  very  similar  to  the  results  of  cepstrum  windowing. 
These  spectral  estimates  are  quite  accurate  in  locating  the  peak  frequencies  (*  good  guess 
at  the  formants).  These  locations  in  frequency  can  be  derived  directly  from  the  solution  of 
the  filter  transfer  function,  but  the  FFT  is  so  fast,  especially  pruned  for  only  p non-zero 
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Figure  1.6;  SPG  Parameters  for  a 20ms.  Window 

inputs,  that  this  is  only  useful  when  formant  bandwidths  are  also  desired.^  Fant  [Fan74] 
has  indicated  that  reasonable  estimates  of  the  formant  amplitudes  can  be  derived  simply 
from  their  frequencies.  Hence,  the  frequencies  themselves  seem  to  be  more  major 
information  bearing  parameters  than  amplitudes  or  bandwidths. 

1.4.5  Summary 

There  is  a vast  range  of  possible  parametric  representations,  many  derived  from 
basic  methods  of  extracting  information  from  the  signal.  It  is  not  possible  to  survey  the 
entire  field,  but  we  have  discussed  the  met.hods  in  common  use  at  the  present  time.  Figure 
1.7  summarizes  some  aspects  of  the  four  parametric  representations  we  havo  chosen. 


t Accurate  estimates  of  the  formant  bandwidths  are  available  from  the  Coyariance  method 
coefficients  [Ata71], 
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Figure  1.7:  Four  Parametric  Representations 


1.5  Problem 

The  main  problem  to  which  this  research  is  directed  is  the  comparison  and 
evaluation  of  parametric  representations  for  speech  and  their  effects  upon  the 
performance  of  speech  recognition  schemes  at  the  acoustic  level.  Enough  background  has 
been  presented  now  to  discuss  limitations  upon  the  problem,  dimensions  of  the 
investigation,  and  goals  of  the  research.  The  task  implied  under  the  broad  statement 
above  is  beyond  the  scope  of  this  dissertation,  and  is,  in  view  of  the  lack  of  clearer  models 
of  the  entire  speech  understanding  process,  beyond  the  state  of  the  art  of  performance 
analysis.  Thus,  the  primary  message  of  this  section  is  how  we  may  limit  the  analysis  so 
that  the  results  will  be  meaningful,  useful,  and  extendable  to  specific  system  analyses. 

1.5.1  Limitations 

The  first  and  foremost  limitation  is  to  consider  only  the  acoustic  level,  and  at  that 
level,  to  consider  only  sources  of  knowledge  that  do  segmentation  and  labeling  of  the  Input 
utterance  into  an  acoustic-phonetic  transcription.  It  is  reasonable  to  make  these 
restrictions.  The  acoustic  parameters  are  primarily  input  to  this  level  only,  although, 
occasionally,  knowledge  about  such  aspects  as  prosodies  will  be  employed  by  higher 
levels.  So  the  main  effect  of  the  parametric  representation  is  felt  through  Its  effect  on  the 
segmentation  and  labeling  processes.  Therefore,  this  effect  can  be  understood  to  a large 
degree  if  the  interface  between  these  processes  and  the  rest  of  the  system  Is  understood. 
That  interface  is  best  characterized  as  a machine  transcription. 

Second,  the  acoustic  level  processes  will  be  measured  as  a separate  sub-system. 
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with  the  interface  strictly  viewed  as  a transcription  of  the  utterance  with  boundaries 
marked  in  time  and  some  encoding  ol  the  identity  of  each  segment.  In  this  way,  the 
interface  Is  more  clearly  understood  and  available  to  analysis,  A performance  model  can 
be  constructed  that  produces  such  transcriptions  it  the  two  process  ol  segmenting  end 
labeling  are  able  to  be  modeled  individually. 

By  way  of  describing  the  general  aspects  of  the  experimental  set-up,  the  following 

are  relevant  dimensions  from  the  speech  understanding  system  goals  in  the  Study  Group 
report  [New71]: 

1)  Continuous  speech  is  to  be  used.  The  articulatory  targets,  and 
hence,  the  resultant  acoustic  patterns,  are  much  less  well  achieved  in 
continuous  speech  than  in  isolated  words,  The  labeling  errors  are 
considerably  different  therefore,  and  segmentation  become-,  harder  as  well. 

2-5)  Cooperative  speakers  will  be  used,  recording  over  a high 
quality  microphone  in  a quiet  room.  The  relaxing  of  these  restrictions  may 
provoke  errors,  but  it  is  likely  that  these  errors  will  be  predictable  In  nature 
- a general  degradation  due  to  many  speakers,  fricative  confusions  due  to 
loss  of  high  frequency  Information,  ttc. 

6 - 7)  Tuning  of  the  acoustics  level  knowledge  will  be  in  the  form  of 
pre -testing  training  data.  The  training  will  be  over  each  speaker’s  utterances, 

although  not  the  same  utterances  as  used  for  testing,  and  thus  will  be  tuned 
to  his  voice. 

8)  The  vocabulary  will  be  chosen  to  include  a wide  range  of  contexts 

for  all  the  commonly  occuring  allophones  of  American  English  phonemes. 

[Sho7Ab] 


1 .5.2  Performance  Dimens. 9ns 

There  ere  essentielly  three  dimeraions  to  the  Investigetion  ol  the  performenee  of 
ecouellc  representelione.  The  liret,  obvioosly  le  the  choice  of  the  r.prc  entetion  Iteelf. 
Here,  the  mejor  teck  in  defining  the  reeeerch  ic  to  Icoiete  repre.entetiv  method,  from 
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among  the  many  possibilities.^  Although  the  possible  combinations  of  filter  arrays, 
waveform  measurements,  spectra,  etc.,  LTe  numerous,  the  representations  that  people  have 
chosen  thus  far  seem  to  fall  into  a few  general  types.  It  is  our  intention  to  represent 
those  types  according  to  currently  available  techniques.  If  someone  invents  • new 
representation  for  speech,  this  research  will  be  available  to  help  place  the  new 
representation  into  the  total  picture.  The  gene'’al  types  of  parameters  are:  simple 
measurements  on  arrays  of  filters  to  obtain  rough  spectral  information  or  to  separate 
different  information  bearing  parts  of  the  signal,  LPC  parameters  of  various  kinds  to 
parametrize  a model  of  the  waveform  either  acoustically  or  probabilistically,  and  spectral 
envelope  estimates  that  seek  io  characterize  the  vocal  tract  response  separately  from  the 
excitation  source. 

Estimating  short-time  spectra  by  the  output  of  an  array  of  bandpass  filters  's 
represented  by  the  Zero-Crossing  count  (ZCC)  parameters  used  in  Hearsay  I [Erm74b]  and 
the  Audio  Spectrum  Analyzer  (ASA)  [Kri75].  The  former  consists  of  five  broad  bandpass 
filters  with  both  peak-to-peak  amplitudes  and  zero-crossing  counts  to  Increas#  the  ability 
to  estimate  frequency  information.  The  latter  consists  of  25  narrow  bandpass  filters 
whose  output  energy  is  measured.  The  LPC  method  developed  by  Market  [Mar72]  (the 
autocorrelation  method)  is  used  to  provide  inverse  filter  coefficients  and  an  estimate  of 
the  spectral  envelope  (SPG)  by  cse  of  an  FFT  algorithm.  Itakura’s  log  ratio  measure 
[Ita75]  will  bo  used  in  conjunction  with  the  autocoerrelation  sequence  (ACS),  although  this 
representation  will  not  be  used  with  other  classification  metrics. 

The  second  dimension  concerns  the  particular  algorithms  used  to  perform  the  tasks 
of  labeling  and  segmentation.  These  will  be  based  primarily  upon  the  pattern  classification 
conct.'pt  of  a pattern  space  distance  metric.  Some  traditional  metrics  employ  first  or 
second  moment  statistical  estimates  of  sample  populations  of  patterns.  Two  specially 


t We  realize  that,  no  matter  which  parametrizations  are  chosen,  someone  will  be  sure  to 
point  out,  "...yes,  but  if  you  use  this,  different  measurement,  you  can  disambiguate  tfiosa 
phonetic  classes..."  The  answer  to  such  comments  is  usually  a question,  "At  what  cost;  and 
with  what  new  errors  introduced?" 
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designed  metrics  will  also  be  used.^ 

The  third  dimension  can  bo  characte’-i^ed  by  the  issue  of  cost.  It  Involves  cost  of 
Implementation,  memory  and  processor  requirements,  and  tfie  effect  of  these  demands 
upon  total  system  speed  and  size  (the  real-time  question).  As  these  Issues  are  much 
better  understood,  especially  at  this  level  where  straightforward,  uniform  procedures  are 
usually  employed,  no  attempt  will  be  made  to  span  this  dimension  with  empirical  results. 

1.5.3  Goals 

Necessarily,  the  goals  of  this  research  are  limited  to  understanding  the  effects  of 
parametric  representations  on  acoustic  level  performance.  Central  to  that  understanding 
are  two  issues  which  may  be  taken  as  goals. 

1)  The  answers  to  designer-voiced  questions  should  be  available.  They 
are  usually  of  the  form,  "How  much  can  I get  for  a certain  amount  of  resource 
expended?"  or  "Will  1 be  satisfied  (i.e.,  will  the  system  1 am  planning  be  able  to 
use  the  acoustic  level  information)?" 

2)  A methodology  for  testing  and  comparing  these  representations 
should  be  available.  New  representations  can,  thus,  be  viewed  In  perspective. 
Advances  in  the  state  of  the  art  will  be  recognized  and  effort  can  be  directed 
more  usefully.  This  requires  a set  of  algorithms  for  parametric  level 
processing  that  are  relatively  independent  of  the  choice  of  parametric 
representation. 

1.5.4  Summary 

In  this  section,  we  have  attempted  to  define  a region  of  the  space  of  possible 
performance  experiments  at  the  lowest  level  of  speech  recognition.  The  entire  chapter 
v/as  aimed  at  fixing  a point  of  view  and  a set  of  basic  assumptions  about  speech 
understanding  systems,  the  parametric  level  of  analysis,  and  performance  evaluation  goals. 


t Bakur's  log  probability  estimate  and  ItaKura’s  log  probability  ratio 
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In  that  point  of  view,  parametric  analysis  is  the  basic  input  for  the  lowest  level  of 
recognition  activity.  That  activity  is  primarily  performed  by  segmenting  and  labeling 
processes,  which  produce  more  manage  :ble  data  for  higher  level  knowledge  sources. 
When  those  knowledge  sources  use  knowledge  from  such  high  levels  as  semantics  or 
pragmatics,  we  may  truly  call  the  system  a speech  understanding  system.  By  carefully 
evaluating  the  performance  of  the  low  level  recognition  processes,  we  may  provide  a firm 
base  for  total  system  peformance  analysis.  We  have  limited  this  research  to  a number  of 
the  most  commonly  accepted  methods  for  parametrizing  the  acoustic  signal,  and  for  doing 
segmentation  and  labeling.  The  results  and  methodology  thus  provided  will  further  our 
understanding  of  many  of  the  issues  of  speech  recognition  activity  at  the  parametric  level. 
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Chapter  2 

Pattern  Classification  Techniques 

This  chapter  consists  of  a short  survey  of  pattern  classification  and  commonly 
accepted  techniques.  (For  further  details  see  [Dud73,  Nag68,  Mni72,  FuS8].)  In  chapter  3, 
we  will  discuss  the  ideas  from  pattern  classification  theory  chosen  for  this  research,  the 
issues  surrounding  a choice  of  classes,  and  considerations  for  training  and  testing  data 
corpi. 

2.1  Basic  Model 

Most  pattern  classification  problems  are  concerned  with  classifying  input  patterns 
into  one  of  a finite  number  of  classes.  One  approach  to  pattern  classification  is  to  Keep  a 
representative  of  each  class,  and  to  match  the  input  for  some  "closeness"  measure  with 
each.  This  has  many  shortcomings,  not  the  least  being  the  lack  of  a way  of  defining  a 
good  template  for  the  various  occurrences  of  speech  phenomena  under  different 
conditions  A more  general  model,  for  which  template  matching  Is  a special  case,  is 
usually  presented.  A series  of  measurements  are  made  on  the  pattern,  either  In  Its 
original  physical  form,  or  from  some  representation  of  it.  These  measurements  should  be 
chosen  for  their  invariance  under  the  kinds  of  informationless  perturbations  expected  and 
for  tifeir  dependence  upon  the  classes  sought  (information  content). 

Assuming  a reasonable  set  of  m features  is  chosen,  their  values  represent  a pattern 
vector  in  an  m-dimensional  feature  space.  The  problem  Is  then  to  provide  a partitioning  of 
that  space.  (If  continuous  valued  classifications  are  required,  a mapping  Into  the  class 
space  Is  needed.) 

A number  of  different  techniques  are  available  for  drawing  these  partitions.  Some, 


t ” The  approach  has  been  used  for  word  identification  in  the  Vicens  system  at  a higher 
level  [Vic69].  First  the  word  is  segmented  and  the  segments  are  classified,  then  the 
duration-normalized  sequence  of  labels  is  matched  with  stored  templates  for  each  word  In 
the  lexicon. 
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by  nature  of  their  returning  a decision  value  related  to  an  estimate  of  the  confidence  or 
closeness  of  class  identity,  can  be  used  to  provide  continuous  classification.  Often, 
however,  these  vaiues  have  little  meaning  outside  of  their  appiication  in  partitioning.  One 
usually  assumes  a single  class  identity  or  an  ordered  subset  of  the  classes  (perhaps  with 
estimates  of  goodness)  is  to  be  returned  by  the  classifier. 

Various  aspects  of  the  acquisition  and  refinement  of  these  partitionings  are  of 
importance.  We  will  discuss  the  size  of  sample  and  test  sets  of  identified  patterns  and 
their  relevance  to  the  expected  results  of  a method  developed  with  such  sets.  Algorithms 
for  automatic  iearning  are  aiso  available.  In  these,  a teacher  is  sometimes  postulated  who 
can  provide  feedback  to  re-adjust  the  partitioning  rules  in  light  of  errors  committed. 
Often  the  set  of  classes  is  not  known,  and  unlabeied  samples  may  be  partitioned  by 
optimizing  various  measures  of  clustering  or  separability. 

By  *ray  of  example,  a simple  pattern  recognition  scheme  might  work  thus: 

Collect,  properly  segment  and  label  a set  of  sample  patterns 
(training  set) 

Average  the  feature  measure.ments  lor  each  class. 

For  another  set  of  labeled  samples  (tec»  set)  compute  the 
Euclidean  distance  to  each  class  averag?  from  the  input 
features  and  assign  the  ciosest  class  as  the  input  identity. 

If  the  classification  is  wrong,  adjust  the  correct  class’s  average 
towards  the  new  input  by  I/n  of  the  distance  (where  n is 
the  number  of  training  samples  in  that  class).  Also  adjust 
the  other  classes  which  were  closer  than  the  correct  one 
away  by  a similar  fraction. 

Obviously,  a great  many  Issues  are  untouched  or  oversimplified  by  this  example. 
But  it  does  serve  to  point  out  a typical  approach.  We  can  easily  show  that  the  decision 
boundaries  thus  drawn  aie  linear.  It  has  been  shown  that  under  certain  conditions  [Nag66, 
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Nag68]  the  adjustment  described  here  converges.  With  some  pre-processing  for 
normalization,  this  method  can  provide  good  results  for  well  clustered  classes. 

2.2  Stochastic  Patterns 

Implicit  in  almost  every  investigation  of  Pattern  Recognition  is  the  assumption  that 
non-doterministic  (stochastic)  processes  are  at  work,  adding  noise  and  otherwise 
transforming  the  original  patterns,  Let  us  model  this  piocess  by  asserting  that  each  class 
corresponds  to  a multivariate  probability  distribution  in  the  feature  space.  If  the  set  of 
classes  corresponds  exactly  with  the  information  intended  to  be  conveyed  by  the  patterns, 
this  will  be  a good  model.  If  not,  there  will  be  in  the  observed  distributions  effects  of 
other  sub-class  distributions^  or  of  correlation  between  the  classes  (in  effect,  clustering 
of  the  clusters)*.  However,  we  may  take  this  model  as  a first  approximation  for  speech, 
although  we  must  investigate  the  distributions  carefully. 

For  the  following  development,  let: 

Pj  be  the  a oriori  probability  of  an  occurence  of  a pattern  in  the 
ith  class. 

fj  be  the  probability  density  function  for  the  ith  class 

X be  the  unknown  pattern  vector. 

then: 

fj(x)“Pr{x|class  ij 
Pj*fj(x)-Pr{x, class  i) 

Bayes  rule  states  that  the  largest  expected  rate  of  correct  classification  is  attained  by 
classifying  x in  class  i if  Pr{x,i}iPr{x,j}  for  all  }^\.  Furthermore,  we  may  define  a loss 
function  L(u,v)  as  the  cost  of  classifying  an  input  in  class  u when  it  should  be  v.  Then  the 
expected  Loss,  or  Risk,  of  a classifying  rule  C(x)  is: 


t --Multimodality  of  the  cluster  for  a diphthong,  or  for  vowels  in  different  contexts 

♦ --The  broad  classifications  of  vowel,  nasal,  fricative,  etc.  are  much  easier  to  effect  than 
more  specific  phonetic  classes. 
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R=  EE  [ L[c(o.),i]] 
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If  we  wish  to  minimize  this  then  we  must  clearly  minimize: 

I L[c,  o]  V% 

67  I 

where  x is  classified  in  class  c. 

Until  more  is  Known  about  the  relationship  between  a particular  speech 
understanding  system  and  the  classifier  it  uses,  we  would  assume  the  first  case  above 
which  corresponds  to  a loss  of  0 for  correct  and  1 for  incorrect  classification.  The 
successful  application  of  Bayes'  rule  rests  upon  the  availability  of  the  underlying 
probability  distributions.  However,  they  may  be  estimated  parametrically  if  their  forms  are 
Known,  or  approximated  by  a number  of  techniques. 

2.3  Overview 

The  methods  for  estimation  of  distributions,  learning  of  parameters,  and  decision 
boundary  drawing  may  be  placed  into  a few  group  that  will  serve  to  clarify  their 
relationship  to  the  basic  model  and  to  optimality  as  represented  above. 


If  the  forms  of  the  distributions  are  available,  we  may  seek  to  estimate  them 
parametrically  by  taking  relevant  statistics  of  the  samples.  For  instance.  If  we  have  good 
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reason  to  believe  that  the  features  are  Independent  variables  and  the  clusters  have 
Normal  distributions,  the  variances  and  means  of  the  features  will  yield  an  optimal  rule. 
Normalize  by  the  variances  and  decide  upon  the  distance  to  the  mean  in  the  normalized 
space.  These  are  essentially  spherical  clusters. 

Where  forms  are  not  known,  a number  of  methods  are  still  available.  The  method  of 
Potential  Functions  [Ais64]  forms  the  sum  of  a number  of  peak-like  functions  ^ each  placed 
at  a particular  sample  point  in  the  class  cluster.  The  amount  of  spread  of  each  peak 
determines  the  smoothness.  Many  heuristic  methods  may  also  be  thought  of  in  this  light. 
The  kth-nearest  neighbor  method  retains  all  the  samples.  The  probability  is  essentially 
estimated  by  the  number  of  samples  in  a class  that  lie  close  to  the  unknown  point. 

Some  decision  rules  may  be  thought  of  as  ignoring  the  distributions  and,  rather, 
seeking  to  find  good  separating  boundaries  direc.ly.  Forms  are  chosen,  as  in  the  cases  of 
linear  or  piecewise  composite  boundaries,  Then  parameters  are  estimated  from  the 
samples.  Equivalences  between  a number  of  methodi:  can  be  shown  theoretically. 

Learning  approaches  seek  to  adjust  the  parameters  of  whatever  methods  are  chosen 
as  new  Information  about  the  patterns  Is  obtained.  Supervised  learning  can  occur  when  a 
correct  label  is  available  for  the  samples  upon  which  learning  takes  place.  When 
completely  unknown  samples  are  presented,  unsupervised  learning  methods  can  still  obtain 
rules  that  separate  the  samples  according  to  the  way  they  cluster^. 

A number  of  transformations  upon  the  pattern  space  may  be  made  to  simplify  the 
task  of  the  recognizer.  This  is  really  a continuation  of  the  basic  pattern  recognition 
problem,  but  many  researchers  have  chosen  to  separate  the  search  for  good  feature 
spaces  from  the  search  for  good  decision  rules. 


t — A spherical  Gaussian  distribution  is  often  used. 

• — Clustering  is  a concept  that  must  bo  defined  mathematically  for  such  learning  to  take 
place. 
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2.4  Estimating  Distributions 

Since  the  Bayes  optimum  lule  assures  us  the  "best"  results  attainable  for  a 
separating  boundary  decision  rule,  we  would  like  to  be  able  to  apply  it.  Unfortunately,  we 
may  not  know  the  probability  densities  or  the  a priori  probabilities  of  the  classes. 
However,  if  there  is  some  evidence  from  the  nature  of  the  feature  measurements,  or  from 
the  underlying  pattern  process  itself,  we  may  be  able  to  estimate  the  a priori  probabilities 
and  to  make  assumptions  about  the  form  of  the  densities.  This  information  might  also  come 
from  statistical  analysis  of  the  samples  such  as  estimates  of  closeness  of  fit  to  well-known 
forms. 


The  mean  vector  and  covariance  matrix  fully  specify  a multivariate  normal  density 
function.  However,  to  compute  the  density  values,  the  covariance  matrix  must  be  inverted. 
The  density  function  is: 


Id''*- 


-i  j (V-M)' 

e ‘ 


where  |C|  is  the  determinant,  C the  Covariance  matrix  (mxm),  M the  mean  vector  (m),  end  x 
the  samples  (m). 


The  classes  may  be  composite  clusters  of  a number  of  forms  or  they  may 
correspond  to  highly  complex  distributions  which  no  simple  form  can  suitably  estimate.  In 
fact,  we  may  not  fully  understand  the  underlying  physical  process  well  enough  to  derive 
the  form  at  all. 


An  important  approach  available  in  such  a case  is  that  of  Potential  functions  (or 
Parzen  estimators)  [Ais64,  Moi72],  The  estimating  density  p is  directly  constructed  by 
superposition  of  a number  of  potential  functions  f as  follows; 


N 
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Thus,  if  those  estimators  are  formed  on  each  of  the  N sample  ooints  y,  the  "density" 
value  of  a point  x Is  a superposition  of  its  relation  to  all  the  points  In  that  class’s  sample 
set.  A typical  form  for  f is  the  multivariate  normal  with  covariance  matrix  a multiple  Of  the 
identity  matrix  (spherical  shape,  independent  dimensions)  and  mean  equal  to  y.  The 
multiple  of  identity  used  for  the  variance  determines  the  sharpness  of  the  peaKs  at  each 
point  and,  thus,  the  smoothness  of  the  overall  function. 

Although  the  Gaussian  is  very  well-behaved,  a more  computationally  efficient 
function  given  by  Meisel  is  f(x,y)-h[d(x,y)]  where  d Is  a distance  function  (e.g.  the  city 
block  function  or  the  sum  of  the  absolute  differences  In  the  m dimensions)  and  h Is  a 
piecewise  linear  window  function  such  as  shown  below  (Figure  2.1). 


In  addition,  if  j^f(x,y)dx-l  for  all  y,  then  p(x)  is  guaranteed  to  be  normalized  If  f la. 
We  can  insure  that  the  space  Is  not  warped. 

Four  criteria  for  the  function  f are  given: 

1- f(x,y)  should  be  maximum  for  x»y 

2- f(x,y)  should  bo  approximately  0 for  x distant  from  y 

3-  f(x,y)  should  be  a smooth  (continuous)  function  of 

distance  from  x to  y 

fl-if  f(xl,y)-f(x2,y)  xl  and  x2  should  be  equally  similar  to  y 
In  some  sense 
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One  may  choose  to  construct  an  inctirect  approximation  to  the  density  function.  Certainly, 
a conceptually  simple  approach  is  to  divide  the  feature  space  Into  small  volumes  and 
collect  a histogram  of  the  sample  patterns.  This  can  be  disastrous  if  the  dimensionality  Is 
high,  however.  The  number  of  small  volumes  in  m dimensions  is  r'’’  If  each  feature  Is 
broken  into  r parts  and  there  are  m features. 

Another  problem  of  the  histogram  Is  that  it  depends  very  greatly  upon  the  choice  of 
volume  chunks  for  buckets,  The  sample  points  may  be  unimodally  distributed,  but  If  the 
buckets  are  chosen  to  split  the  mode,  and  if  not  enough  samples  are  available,  spurious 
modes  may  be  observed.  Another  technique  is  to  collect  the  Empirical  Cumulative 
Distribution  Function,  The  N samples  are  ordered  and  plotted  at  increments  of  1/N  against 
their  values,  for  the  single  variate  case.  The  resulting  distribution  approximation  may  be 
smoothed  and  the  slope  measured  for  a density  function,  The  advantage  can  be  seen  if 
one  notes  that  the  ECDF  depends  in  some  sense  on  all  the  points  that  produce  the 
ordering  rather  than  the  points  in  a single  bucket  for  the  shape  at  any  particular  location. 
It  is  thus  a cumulative  estimate  rather  than  a local  one.  Techniques  exist  for  estimating 
the  closeness  of  fit  between  standard  distributions  and  an  ECDF  [Wil68].  Unfortunately, 
extending  the  concept  to  the  multivariate  case  is  difficult  and  not  usually  done.  Howes'er, 
if  there  is  reason  to  suspect  that  the  features  are  independent,  the  ECDFs  can  be  made 
separately  on  each  dimension. 

2.5  Linear  Forms 

The  simplest  form  for  any  decision  boundary  that  partitions  the  feature  space  into 
two  separate  parts  is  a linear  form.  Linear  discriminant  functions  have  a number  of  points 
of  appeal  and  have  been  extensively  investigated  ^ Often  good  linear  discriminants  may 
be  determined  as  approximations  or  special  cases  of  more  general  forms.  The  advantage 
of  simplicity  In  the  decision  'ule  is  apparent  to  anyone  who  considers  producing  a 
classification  of  a speech  signal  into  one  of  50  classes  each  10ms.  window,  using  a vector 


t — Nagy  [Nag68]  presents  a theoretical  comparison  of  a number  of  linear  functions  for  a 
two  class  problem. 
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of  128  FFT  spectral  values.  Computationally,  a linear  decision  rule  can  be  effectad  with  m 
multiplies  and  a comparison,  for  m dimensions.  ^ The  N class  problem  may  take  from  log2 
N to  N-1  decision  boundaries.  Thus,  at  least  80,000  multiplies  per  second  would  be 
required  for  the  above  classification  scheme  to  be  done  in  real  time. 

An  additional  advantage  Is  that  linear  rules  are  easily  parametrized,  and  thus, 
learning  can  take  place  by  means  of  the  adjustment  of  a reasontply  small  number  of 
parameters.  The  effects  of  transformations  of  the  feature  space  are  more  easily 
understood  in  connection  with  simple  rules.  Finally,  higher  order  forms  for  decision  rules 
can  be  reduced  to  linear  form  with  addition  of  extra  dimensionality. 


2.6  Distances 


The  first  thing  that  comes  to  mind  when  considering  the  pattern  space  clusters  for 
classification  is  to  somehow  use  a distance  measure  from  the  unknown  to  the  clusters  In 
the  decision  rule.  A number  of  distance  measures  have  been  defined  for  this  purpose. 
Although  we  are  now  faced  with  slightly  more  computing,  since  we  must  make  a distance 
calculation  for  each  class  as  opposed  to  successive  dichotomies  o’  the  space  by 
boundaries,  this  approach  is  more  easily  adapted  to  different  sets  of  classes.  We  need  not 
depend  upon  the  fortuitous  placement  of  clusters  where  one  decision  can  discard  a largo 
set  of  classes.  Furthermore,  the  usual  properties  of  distance  measures  ensure  that  there 
are  no  areas  of  the  space  where  no  class  identity  is  assigned. 


Euclidean  distance  is  defined,  in  m-dimensions,  as: 

r ^ 2 “I 


t --If  the  hyperplane  dividing  the  classes  has  the  equation: 

1 ^i^-i  + ^ = o 

then  It  can  be  seen  that  all  vectors  x where  w.x  ■ lie  on  the  plane.  Hence  the  decision 
rule  is  to  form  the  dot  products  and  compare  with  \.  The  distance  to  the  hypairplane  Is 
•Iso  easily  calculated  as  Iw.xAl/HwH. 
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the  locus  of  points  such  that  d{x,ml)-d{x,m^)  (i.e.  the  boundary  between  two  classes 
represented  by  and  m^)  is: 

(’^1  - 1 H)^  - f (,'^'S  = o 

i-.,  jr, 

a linear  equation  in  x.  Geometrically,  the  boundary  is  a hyperplane  which  passes 
perpendicular  to  the  segment  joining  m^  and  through  its  midpoint. 


Correlation  is  defined  as: 


IIX/I  Hall 


or  simply  the  cosine  of  the  angle  in  m-space  between  the  two  vectors.  The  boundary 
between  two  classes  here  passes  through  the  origin  and  is  at  right  angles  to  the  plane 
containing  the  origin  and  the  two  representatives. 

More  complex  forms  may  yield  a distance  measure  which  is  still  linear.  The 
opproximate  maximum  likelihood  method  assumes  a single  covariance  matrix  for  all  the 
classes  and  multivariate  normal  distributions,  In  this  case,  it  can  be  shown  that  the  locus 
of  equal  probability  between  two  classes  is  a hyperplane  cutting  the  segment  connecting 
the  means  at  the  midpoint  but  not  necessarily  perpendicular.  The  equation  Is: 
u/  y-  - c - O 


whertf.  (V  — Cr*'  ^ 

c = ^ ( r.-  A"' 

and  the  distance  measure  is  : 


A - Cova.rta.nce 

E mean 


d(x,y)  = (x- 

There  is  a more  computationally  costly  procedure  than  computing  from  the  hyperplane 
equation,  sometimes  called  the  Mahalanobis  distance  [Dud73]  which  may  be  used. 
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2.7  Piecewise-linear  Discriminants 

Since  linear  discriminants  are  computationally  inexpensive,  we  may  wish  to  define 
areas  of  the  feature  space  in  which  different  hyperpianes  are  used  without  fear  of 
excessive  cost.  In  fact,  arbitrarily  complex  boundaries  can  bo  approximated  in  this  fashion. 

The  Perceptron  model  [Ros57]  taker,  just  such  an  approach.  Porceptron  networks 
are  oriented  towards  learning  linear  boundaries  between  two  classes.  When  applied  to  the 
N class  situation,  the  boundaries  are  ail  applied  in  a pairwise  fashion,  and  the  classification 
is  made  upon  the  advice  of  a number  of  classifiers.  Figure  2.2  shows  such  a situation. 
Note  that  the  assignment  is  made  when  an  input  pattern  gets  at  least  two  votes  in  this 
case. 


We  may  have  a good  idea  of  the  implicit  subgroups  within  each  class.  In  this  case, 
we  may  define  bounaaries  separating  samples  from  each  subclass  from  the  rest  of  the 
samples  in  other  classes.  Then  define  a piecewise  boundary  as  Max(wj.x),  where  Wj  are 
the  coefficient  vectors  of  the  various  subclass  discriminants.  It  can  be  demonstrated  that 
this  forms  a convex,  piecewise  boundary  around  the  composite  of  the  subclasses. 

It  is  dear  that  certain  aspects  of  linear  discriminants  are  desirable  for  speech 
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applications.  These  include  most  notably  speed,  ease  of  adaptability,  and  ease  of 
pp,  ametri:ration  (size  of  the  learning  task).  However,  some  co'istraints  of  *he  speech 
P’-oblem,  such  as  the  need  to  deal  with  more  than  two  classes,  make  many  of  tnese 
methods  unwieldy  to  apply  or  too  computationally  costly.  The  amount  of  computation 
needed  to  evaluate  the  Mahalanobis  distance  directly  is  equivalent  to  that  of  using  the 
individual  covariance  matrices  in  a maximum  likelihood  rule.  Thus,  the  need  tc  handle  many 
clusters,  of  unknown  placement  in  the  feature  space,  leads  to  an  algorithmic  structure, 
distance  measures,  which  voids  the  usefulness  of  the  approximate  maximum  likelihood 
assumption.  Without  a better  idea  of  the  actual  clustering  of  the  classes,  one  cannot 
suggest  a generally  applicable  or  even  reasonable  method  for  all  speech  classification,  a 
prio’i.  The  conceptualization  of  the  pattern  space  which  most  theoretical  investigators 
have  brought  to  the  problem  may  be  invalid  for  speech.  For  instance,  a g eat  deal  of 
overlap  may  be  the  inevitable  result  of  variations  in  speaker,  performance,  and  phonetic 
context.  This  is  certainly  borne  out  by  the  difficulty  which  even  experienced  phoneticians 
have  in  identifyir;  speech  sou  ids  in  certain  contexts  or  under  certain  conditions.  The 
best  characteristic  of  linear  rules  is  that  they  provide  an  extremely  simple  structure  for 
learning  or  tracking,  and  that  they  are  computationally  cheap.  The  ability  to  adapt  the 
classifiers  under  the  aegis  of  higher-level  feedback  will  be  much  more  impnrtant  to  the 
overall  task  than  careful  optimization  of  performance  in  a static  situation. 

2.P  Learning  and  Tracking  of  Clusters 

The  sets  of  samples  from  which  decision  rules  are  deduced  ard  their  parameters 
extracted  are  called  training  sets;  the  process  of  acquiring  the  p&rameter.s  in  a statistically 
proper  manner  is  dealt  with  by  Bayesian  learning  methods.  In  an  Important  sense,  the 
pattern  classifier  is  a learning  process  — although  of  a very  simple  Kind  where  learning 
may  only  occur  at  set  t'mes  (training).  A number  of  attempts  have  been  made  to  employ 
learning  processes  more  directly  (e.g.  [Dru73,  Nag66,  Sel63,  Uhr63]),  to  allow  a classifier 
to  seek  its  own  parameters  or  its  own  rules,  to  allow  it  to  adapt  them  to  slowly  varying 
pattern  spaces,  and  to  allow  the  identification  of  patterns  for  which  no  classification  was 
expected. 
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Many  algorithms  have  been  set  forward  which  guarantee  convergence  to  an  optimal 
linear  classifier  under  various  conditions.  A famous  one,  for  which  an  upper  bound  on  the 
number  of  steps  for  convergence  was  derived,  is  the  perceptron  error  correcting 
algorithm  [Ros57].  The  training  set  is  presented  in  order  as  often  as  necessary  and  the 
coefficients  of  the  linear  boundary  separating  the  two  classes,  Cl  and  C2,  change  as 
follows,  for 

w(j+l)  - w(j)  + x(j)  if  x(j)  in  Cl  and  w(j) . x S \ 
w(j+l)  - w(j)  - x(j)  if  x(j)  in  C2  and  w(j) . x ^ \ 
w(j+l)  - w(j)  otherwise 

where  x is  classified  as  Cl  if  w . x,i  X,  etc.  The  weights  are  thus  adjusted  on 
misclassification,  A number  of  variations  on  this  scheme  deal  with  various  amounts  for 
adjusting  the  w vector,  varying  the  order  of  presentation,  and  the  problem  of  using 
imperfect  components.  Nagy  [Nag68]  discusses  some  of  these  results  and  also  points  out 
that  if  the  problem  (i.e.  the  samples  from  the  two  classes)  is  not  indeed  linearly  separable, 
learning  methods  may  not  converge.  Instead,  they  may  oscillate  or  converge  on  a local 
optimum.  However,  one  way  of  gaining  insight  into  whether  two  i^lesses  are  linearly 
separable  is  to  try  such  a learning  scheme  on  them. 

Related  to  these  methods  are  devices  for  tracking  varying  pattern  clusters  by 
adjusting  a linear  discriminant  to  follow  new  patterns  as  they  are  presented.  It  has  '.'een 
assumed  so  far  that  the  training  samples  arc  drawn  from  the  same  distribution  as  the 
unknown  patterns  will  be.  However,  there  may  be  overall  changes  in  the  pattern  producer 
that  shift  the  clusters  very  slowly  (relative  to  the  frequency  of  incoming  patterns)  as  one 
proceeds  with  actual  recognition.  Consider  the  effects  of  excitement,  fatigue,  or 
confidence  upon  a speaker  dealing  with  a computer  speech  recognition  system.  Some 
speaker  normalization  may  bo  treated  in  this  fashion  as  well.  Alth^gh  in  the  case  of 
changing  speakers,  the  clusters  shift  suddenly,  such  a change  happens  rarely  in  the  time 
scale  we  are  considering  and  much  structure  of  the  pattern  space  is  common  among 
different  speakers.  Such  adaptive  behavior  in  pattern  recognition  dates  bach  to  work  on 
Morse  code  [GolB59,  Sel63]. 
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While  many  of  the  methods  discussed  here  are  on  shaky  footing  without  an  idea  of 
some  of  the  aspects  of  the  particular  patterns  in  question  - how  well  they  cluster,  what 
shape  the  dusters  are,  or  how  many  --  there  do  seem  to  be  areas  of  applicability  in 
speech  recognition.  In  particular,  it  may  be  possible  from  phonetic  studies  to  predict  how 
many  clusters  will  be  found  and  develop  a good  idea  of  the  hierarchy  of  classes  Into  which 
they  may  be  placed.  However,  we  must  take  care  not  to  infer  relationships  upon  the 
pattern  space  that  we  feel  exist  in  some  linguistic  model  of  speech  perception.  The  best 
application  for  these  methods  is  in  discovering  how  well  our  ideas  of  the  proper  set  of 
classes  coincide  with  particular  feature  spaces,  the  data,  and  rules  we  have  chosen.  In 
addition,  tracking  techniques  may  well  prove  useful  in  going  from  one  speaker  to  another 
or  in  changing  acoustic  environments.  The  ability  to  train  without  tedious  hand  labeling  of 
speech  data  would  be  a great  help,  but  the  assumptions  necessary  for  unsupervised 
learning  algorithms  thus  far  discovered  do  not  seem  to  apply  well  to  speech. 


2.9  Conclusions  and  Discussion 

This  survey  has  taken  a narrower  view  of  oattern  classification  than  is  sometimes 
set  forth  by  those  who  have  viewed  entire  pattern  recognition  systems  (e.g.  [Min63, 
New71,  Uhr73]).  The  gross  description  of  any  pattern  recognition  problem  is  simply:  map 
a space  of  patterns  onto  a space  of  symbols  (usually  a finite  set  of  names).  The  speech 
recognition  problem  can  be  variously  viewed  as  any  of  the  following  mappings; 


utterances 
phrases 
words 
segments 
time  slices 


->  semantic  states 

->  syntactic  structures 

->  lexical  entries 

->  phones 

->  acoustic-phonetic  labels 


This  many  level  view  Is  taken  in  some  currently  operative  systems  such  as  Dixon 
[Dix75a]  and  Reddy  [Red73].  Each  level  has  its  own  data  structures  and  sources  of 
knowledge  from  which  rules  of  varying  complexity  can  be  deduced  — either  for 
recognition  or  interlevel  translation.  The  same  thing  occurs  in  visual  pattern  recognition 
with  the  various  leveis  proposed  for  scene  anaiysis  and  picture  decomposition.  Speech  In 
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particular,  has  large  bodies  of  research  relevant  to  these  levels,  although  most  of  the  rules 
are  available  in  generative  form  [New71].  Thus  the  question  of  where  to  draw  the  line 
between  "raw"  pattern  classification  and  various  processes  for  reduction  of  search, 
inference  of  goals,  syntactic  analysis,  and  even  phonetic  segmentation  is  a system 
structure  question.  A good  decomposition  of  the  patterns  at  each  level  exists  In  the  rules 
that  translate  to  the  next  lower  level.  The  decomposition  Into  phone-like  labels  for  short 
time  windows  is  the  most  primitive  one  proposed.  There  is  general  agreement  that  the 
burden  of  complex  processing,  feedback,  context,  etc.  should  be  placed  on  the  higher 
levels  with  the  stream  of  labels  serving  as  input  to  them.  This  sort  of  constraint  will  be 
necessary  in  order  to  achieve  real-time  response  in  the  future.  Furthermore,  a translation 
to  symbolic  form  must  be  made  on  the  input  signal  for  space  and  time  economies  in  the 
higher  level  processes  as  well. 

The  previous  survey  provides  an  overview  of  the  methods  available  for 
classification.  However,  some  specific  limitations  must  be  made,  justified  by  what  is  known 
about  speech,  in  order  that  the  comparisons  that  are  the  object  of  this  research  may  be 
made  in  a reasonable  time  frame.  The  central  aim  of  this  discussion  is  to  fix  upon  a useful 
environment  for  comparing  parametric  representations.  At  the  level  of  labelling  the 
acoustic  signal,  that  means  finding  a "typical"  algorithm  or  family  of  algorithms  for 
classification.  The  constraints  placed  upon  this  choice  are  time  and  space  requirements, 
the  need  for  graceful  error  recovery  and  robustness  under  variations  In  the  input  quality, 
and  the  necessity  of  experimenting  with  methods  at  the  level  of  current  technology. 

Speech  understanding  systems  have  not  generally  employed  extremely  powerful 
pattern  classification  schemes.  The  view  has  been  often  expressed  that  the  problem  must 
be  addressed  at  multiple  levels  by  a variety  of  knowledge  sources.  The  benefit  of  a 
highly  tuned  and  powerful  classifier  is  nullified  by  the  fact  that  speech  hjs  such  inherent 
variability  that  even  different  utterances  by  the  same  speaker  will  demonstrate  widely 
differing  patterns  for  the  same  informational  elements.  The  warning  against  overkill  In 
tuning  to  any  set  of  training  samples,  no  matter  how  extensive,  has  been  made  throughout 
the  pattern  recognition  literature,  and  speech  is  clearly  subject  to  this  problem. 
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We  have  chosen,  therefore,  n few,  well-accepted,  classifiers,  simple  and  robust,  that 
cover  much  of  the  current  speech  understanding  usage.  The  existence  of  more  complex 
methods  at  the  state  of  the  art  should  not  invalidate  these  results,  since  the  comparative 
performance  of  parametric  representations  in  the  context  of  these  straightforward 
methods  will  serve  as  guides  to  the  design  of  simpler,  soon  realizable  systems  for  limited 
tasks,  as  well  as  first-order  predictions  of  more  complex  classifiers  built  upon  the  basic 
classifica'ion  algorithms.  The  latter  will  be  true  for  the  following  reasons:  The  complex 
methods  for  classifying  sounds  thus  far  proposed  have  been  based,  perhaps  in  a hierarchy 
of  decisions,  upon  simple  concepts  of  "closeness"  or  "matching"  in  a pattern  space  or  over 
certain  elements  of  evidence.  Second,  the  evidence  provided  by  human  production  and 
recognition  of  speech  seems  to  imply  a continuum  of  sounds,  arbitrarily  interpreted  as 
belonging  to  information  bearing  classes.  4 classifier  and  parametric  representation  which 
adequately  capture  the  structure  of  that  continuum  can  be  expected  to  perform  well  In 
partitioning  it.  Thus,  a parametric  representation  which  allows  some  uniform  distance 
function  to  separate  patterns  according  to  their  actual  information  class,  will  have 
indicated  this  similar  structure,  as  well  as  an  amenability  to  the  use  of  concepts  of 
closeness  in  more  complex  methods. 

The  concept  of  distance  is  central  to  the  use  of  a pattern  space  as  a representation 
of  the  actual  phenomena.  Thus  the  simplest  method  of  classifying  a pattern  is  according  to 
the  distance  from  that  pattern  to  the  clusters  of  sample  patterns.  If  the  clusters  are 
parametrized  by  some  Mj  for  samples  belonging  to  class  Cj,  then  the  distance  is  some 
function  of  the  input,  x,  and  Mj,  and  x is  classified  in  Cj,  where  j minimizes  the  distance. 
The  property  that  the  distance  is  a function  of  a single  class’s  parameters  means  that  no 
region  of  the  pattern  space  is  unclassified,  since  every  point  must  produce  some  minimal 
value.  If  there  are  m classes  then  m evaluations  must  be  made  and  m sets  of  cluster 
parameters  must  be  stored.  A hierarchy  of  decisions  would  provide  a clear  computational 
improvement,  but  may  be  derived  after  the  representation  and  distance  function  are 
chosen,  or  may  be  deduced  from,  the  pairwise  distances  of  the  clusters  themselves.  In  the 
context  of  this  research,  the  independent  evaluations  provide  flexibility  in  the  choice  of  a 
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set  of  classes  and  a way  of  comparing  entire  parametric  spaces  to  one  another  upon  a 
simple  base. 

Finally,  the  value  returned  as  a distance  may  be  used  to  estimate  the  probability  of 
that  class  being  the  proper  choice.  Some  distances  estimate  the  distributions  of  patterns 
for  each  class  and  provide  this  probability  directly.  Others  may  be  compared  to  empirical 
distributions  of  distances  for  that  class.  When  such  probabilities  are  available,  meaningful 
combinations  of  classifications  can  bo  made  and  tfie  information  available  from  classes  that 
are  close  to  minimal  distance  can  bo  employed. 

The  choice  of  a limited  Kind  of  classifier  will  not  hurt  the  usefulness  of  the  results 
because  most  classifiers  are  based  upon  the  distance  concept  But  more  importantly,  the 
c.ioice  will  help  because  results  will  be  easily  applied  to  simple  systems  or  parts  of  more 
complex  ones.  The  role  of  parametric  representations  can  bo  more  dearly  seen  in  the 
light  of  results  that  measure  performance  of  simple  algorithms.  And  the  structural 
similarity  of  the  pattern  space  to  the  space  of  speech  sounds  is  brought  clearly  into  focus. 
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Chapter  3 

Experimental  Considerations 

This  chapter  continues  the  discussion  of  pattern  classification  Issues.  Two  very 
important  issues  are:  1)  the  set  of  recognition  torgetj  --  the  information-bearing  classes, 
as  distinct  from  the  tempLates  for  those  targets  — which  are  the  object  of  classification 
activity;  and  2)  aspects  of  data  quality  and  quantity.  The  methodology  of  training  and  the 
experiments  we  have  devised  in  this  reses'^ch  are  strongly  affected  by  those 
considerations. 

3.1  Acouotic-Phor>etic  Classes 

A very  important  issue  is  the  number  and  nature  of  the  classes  that  are  the  output 
of  any  classifier.  Sometimes  this  choice  is  trivial  --  in  fault  recognition  in  machinery  for 
example,  there  are  two  classes,  faulty  and  fault-free.  Often,  even  in  binary  choices  like 
this  however,  there  are  many  sub-classifications  --  unnecessary  in  all  but  a few  special 
cases  (such  as  "fault-free  but  having  a slight  vibration  that  might  lead  to  faults  after 
OAtended  use"  to  continue  the  example.)  The  situation  in  speech  is  just  such  a one. 
Nasalized  vowels,  devoiced  glides,  t'’ansition  portions,  deleted  or  altered  consonants  all 
represent  subclasses  of  things  one  may  wish  to  deal  with  most  of  the  time  as  entire 
classes.  Consequently,  we  feel  some  discussion  of  the  directions  available  is  warranted. 
Our  view  of  speech  understanding  systems  is  that  non-local  considerations,  relating  to 
context,  speaker  idiosyncracies,  reduction  of  search,  and  Knowledge  about  other  levels 
than  acoustic-phonetics,  will  be  dealt  with  at  those  other  levels  by  other  processes. 
Hence,  the  kinds  of  recognition  that  some  systems  have  done  or  propose  to  do  — tracking 
slopes  of  formants,  for  example  to  disambiguate  consonants  by  the  transitions  into 
following  sonorants  --  will  reqjire  different  objects  of  recognition  than  the  process  that 
notices  the  appropriate  conte/t  in  which  to  Invoke  such  specialized  rules. 


Data  reduction,  representation  transduction,  and  hypothesis  generation  are  the 
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principal  roles  for  the  processes  we  are  Investigating.  These  require  general  utility  over 
a broad  range  of  speech  sounds,  robustness,  and  low  cost  at  least  for  some  (initial) 
parametric  analysis  routines, 

There  are  two  alternative  approaches  that  have  been  taken  in  defining  or 
discovering  classes  for  speech  pattern  classification,  which  may  be  called  acouttic  gviturBs 
and  features.  The  acoustic  gesture  approach  takes  the  view  that  the  phonetic  significance 
Of  a speech  segment  is  to  be  extracted  by  other  sources  of  knowledge.  The  classes  into 
which  the  segment  is  mapped  have  phonetic  correlates,  to  be  sure,  but  those  correlates 
are  subject  to  context  and  speaker  variations  as  well  as  the  acoustic  nature  of  the 
segment.  A set  of  gestures  is  chosen,  therefore,  which  represents  all  the  significantly 
different  sounds  encountered  In  speech.  Where  the  difference  between  two  phonetic 
classes  is  clearly  reflected  In  a difference  in  their  acoustic  realizations,  the  task  of 
differentiating  them  is  accomplished  by  differentiating  the  corresponding  acoustic  gestures. 
Where  the  same  sounds  may  realize  different  phonetic  classes,  however,  an  optimization  of 
sorts  must  be  made,  balancing  usefulness  of  the  information  supplied  by  identifying  the 
acoustic  gesture  with  the  complexity  of  a growing  number  of  specialized  cases,  each 
identified  as  a separate  gesture  and  many  overlapping  In  the  pattern  space.  The 
boundaries,  and  thus  the  classes,  most  suitable  to  the  problem  divide  the  pattern  space 
into  regions  such  that  any  pattern  in  a particular  region  could  be  a realization  of  the 
phonetic  situation  that  produced  any  other  pattern  in  that  region. 

The  feature  approach  is  welt  described  by  Meisel:  "The  selection  of  a set  of 
features  which  efficiently  describe  a system  in  terms  of  a pattern  to  be  recognized  in 
those  features  is  itself  a pattern  recognition  (problem).  Each  feature  describes  some 
aspect  of  the  pattern  and  amounts  to  a decomposition  of  the  quality  to  be  recognized  In 
the  overall  problem  into  a set  of  more  easily  recognized  qualities."  [Mei72]  Since  the 
object  is  to  disambiguate  phonetic  situations,  the  features  must  be  those  qualities  that 
distinguish  different  such  situations.  While  the  parametric  representations  may  be  viewed 
as  features,  we  would  prefer  to  reserve  the  term  for  those  qualities  that  more  highly 
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correlate  with  phonetic  information.  For  example,  energy  in  the  frequency  band  from  3KHz 
to  5KHz  may  often  imply  the  feature  "Fricative,"  yet  there  are  enough  cases  of  non- 
fricatives, high  vowels  for  instance  which  produce  moderate  amounts  of  energy  In  that 
band.  Obviously,  a parametric  representation  which  effectively  allows  fdr  extraction  of 
these  teleological  features  is  a good  one,  but  it  has  been  the  experience  In  speech 
research  that  such  a representation  is  very  hard  to  find. 

Under  the  influence  of  the  simple  model  of  a classifier  used  in  this  research,  these 
two  points  of  view  merge.  More  complex  recognizers  often  mix  the  two'  views  in 
hierarchical  schemes  where  the  presence  of  certain  features  may  trigger  attempts  to 
classify  among  a sub-set  of  the  target  phonetic  labels  [Erm7^b].  If  the  gestures 
correspond  to  a set  of  features,  then  the  fact  of  recognizing  a gesture  provides  evidence 
in  favor  of  its  features’  presence.  Likewise,  the  set  of  features  recognized  forms  an 
address  of  the  gesture.  The  duality  of  this  relationship  depends  upon  having  the 
corresponding  weights  available  to  calculate  how  much  evidence  is  provided.  (See  figure 
3.1  for  the  features  and  weights  used  in  Hearsay  II,  phonetic  hypothesizing.) 


iFtHlS: 

HI 

niu 

LCl 

FRNT 

UNT  HK 

HNO 

KTR 

VEL 

NflS 

Hlfl  VCD 

NUL 

LOF 

FRC 

VOC 

CON 

1 OIPH 

(MCTSi 

75 

75 

75 

75 

75 

75 

108 

100 

100 

100 

50  100 

100 

100 

100 

•0 

00 

1 100 

- 

♦5 

-5 

-5 

-5 

-5 

-5 

-5 

-5 

-5 

-5 

-5  -5 

45 

-5 

-5 

-5 

45 

-5 

B 

♦5 

-5 

-5 

♦5 

-2 

-5 

-2 

-5 

-5 

-5 

-♦  45 

-2 

-1 

-5 

-5 

45 

-5 

P 

♦3 

-3 

-5 

♦5 

-2 

-5 

-2 

-5 

-5 

-5 

4l  -5 

-5 

-5 

45 

-5 

45 

-5 

F 

♦3 

-3 

-5 

♦5 

-5 

-5 

-2 

-5 

-5 

-5 

-2  -5 

-2 

-5 

4« 

-5 

45 

-5 

Z 

♦3 

-3 

«5 

-2 

♦5 

-3 

-5 

-5 

-5 

-5 

42  45 

-5 

-5 

45 

-5 

45 

-5 

s 

♦3 

-3 

-5 

-2 

♦5 

^2 

-5 

-5 

-5 

-5 

43  -5 

-5 

-5 

45 

-5 

45 

-5 

R 

♦ 1 

-3 

-5 

-2 

♦5 

«e 

40 

45 

-1 

-5 

43  45 

-5 

-2 

-5 

-2 

40 

-5 

flX 

-2 

-2 

-2 

♦5 

-5 

-5 

-2 

-5 

43  4a 

-5 

-5 

-5 

45 

-5 

-5 

Nfl 

-5 

-5 

*9 

-5 

-5 

-5 

-5 

40 

-5 

45  45 

-5 

4t 

-6 

4S 

-5 

-5 

Figure  3.1;  Some  Phonetic  Features  and  Weights,  HSII 


In  recognizing  continuous  speech,  segments  are  usually  labeled  with  some  phonetic 
information.  As  more  phonetic  information  is  available  for  a si’gment  and  its  context,  one 
can  state  with  more  certainty  the  phonemic  interpretation  of  that  portion  of  the  signal.  An 
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interpretation  is  being  made  and  the  speech  understanding  system  is  making  it.  Thus,  the 
most  important  criterion  for  the  source  of  knowledge  which  provides  the  phonetic 
information  is  whether  it  is  adequately  serving  the  needs  of  the  rest  of  the  system.  Those 
needs  are  for  interpretations  to  be  made  consistently--  certain  phones  In  certain  contexts 
should  be  interpreted  in  the  same  way  --  and  robustness  --  unusual  contexts  or  poor 
cC‘nditio.'<s  should  bring  about  interpretations  which  are  wrong  in  proportion  to  the  degree 
of  degradation  encountered. 

In  conflict  with  recognizing  a single  phonetic  situation  is  the  problem  that  features 
interact  in  complex  ways  under  various  conditions.  The  feature  “high  third  formant"  may 
mean  entirely  different  things  depending  upon  the  location  of  the  second  formant,  or  may 
be  irrelevant.  These  interactions  could  be  expressed  as  rules  and  the  appropriate 
recognitions  could  be  the  output  of  a complex  interpretive  stage.  The  question  Is  whether 
anything  is  gained.  The  key  to  a suitable  feature  decomposition  of  the  pattern  recognition 
problem  is  the  discovery  of  a set  Of  features  which  are  easier  or  more  accurate  (or  both). 
While  many  methods  of  speech  analysis  have  attempted  to  provide  parameters  that  make 
such  recognitions  easy,  none  has  been  able  to  carry  the  entire  burden.  Some  features  are 
easily  extracted  from  one  parametric  representation  and  others  are  just  confused,  yet  all 
are  necessary  if  critical  errors  are  to  be  avoided. 

If  one  attempts  to  interpret  the  speech  signal  by  the  results  of  scoustic  gesture 
recognition,  one  finds  that  different  phones  often  give  rise  to  the  same  acoustic  gesture, 
and,  conversely,  that  the  same  phone  can  be  influenced  by  phonetic  context  and  conditions 
of  emotion,  speaker  accent,  prosodies,  and  so  on,  to  produce  different  acoustic  gestures. 
In  the  same  way  that  features  might  be  disambiguated  by  a system  of  rules,  so  could  a 
string  of  acoustic  gestures  by  corrected.  The  context  dependent  errors  will  appear  in 
sequence  with  labels  that  Indicate  the  context.  Alternative  labels  of  acoustic  gestures 
would  give  an  indication  of  likely  candidates  for  error  correction.  The  sequence  /g  I e - s/ 
for  example,  could  be  altered  to  read  /g  e t/  by  application  of  some  simple  and  obvious 
rules  from  the  phonetic  source  of  knowledge.  In  fact,  features  could  bo  extracted  from 
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the  sequence  of  acoustic  gestures.  The  recognition  of  /m/  means  "voicing", "low  second 
formant",  and  "nasal"  features  are  present  to  some  extent.  Again  the  major  Issue  Is 
whether  the  recognition  process  is  easier  or  more  accurate. 

Taking  the  entire  label  as  a unit  will  tend  to  help  avoid  errors  resulting  from 
interactions  with  one  or  two  wrong  features.  The  classifier  can  rank  the  labels,  and  thus 
provide  graceful  error  correction.  On  the  other  hand,  in  order  to  represent  a wide  range 
of  acoustic  situations  encountered  in  continuous  speech,  and  to  provide  sufficient 
disambiguation  of  similar  phones  under  most  conditions,  the  set  of  acoustic  gestures  must 
be  fairly  large  --  multiple  labels  for  each  segment  may  be  required.  The  trade-offs 
depend  upon  system  organization  to  some  extent.  The  amount  of  higher  level  support  Is  a 
significant  factor  in  this  choice.  However,  the  essential  interchangeability  of  the  two  views 
should  be  apparent.  For  the  purposes  of  this  research,  it  does  not  matter  whether  the 
parametric  representations  are  compared  for  their  feature  recognition  support  or  their 
acoustic  gesture  recognition.  Rather,  their  ability  to  allow  the  classifier  to  come  up  with 
the  label  or  feature  set  best  correspondirg  to  the  phonetic  situation  can  be  measured 
over  the  entire  test  data  by  either  method  of  recognition. 

The  labeling  results  that  will  be  reported  are  accuracies  over  a set  of  acoustic 
gestures  with  phonetic  interpretations.  The  selection  of  *hese  is  important,  as  Is  the 
training  ot  the  classifiers  used. 

In  chapter  7,  we  will  diccuss  an  algorithm  for  discovering  the  natural  clustering  of  a 
set  of  sample  patterns,  and  for  discovering  representatives!  --  templates  --  for  each 
cluster  from  the  samples  themselves.  This  algorithm  is  used  to  refine  a set  of  phonetic 
labels  --  targets  --  which  have  been  marked  over  a training  corpus  of  speech  parameters 
by  hand  segmentation  and  labeling.  There  are  a great  many  other  methods  for  refining 
such  a set  of  classes.  Iterative  adjustment  techniques  are  discussed  in  the  pattern 
classification  literature  [Nag68].  In  addition,  learning  techniques  might  be  used  to  adjust 
the  set  as  well  as  to  "track"  slow  shifts  in  the  nature  of  the  data  in  an  operating  speech 
understanding  system.  A careful  survey  and  analysis  of  the  relative  merits  of  these 
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methods  seems  to  be  rather  far  removed  from  the  central  issues  of  this  dissertation.  We 
are  primarily  concerned  with  acquisition  of  a reasonable  set  of  classes  for  testing  labeling 
proficiency  in  a benchmark  procedure.  The  clustering  method  implemented  has  given 
evidence  of  quite  adequate  performance,  as  well  as  of  being  consistent  with  the  viewpoint 
of  acoustics-oriented,  parameter  independence. 

3.2  Data  Quality  and  Quantity 

The  discussions  above  have  been  concerned  with  empirical  methods  for  pattern 
classification  for  speech  understanding  systems.  Consequently  the  data  upon  which  the 
methods  are  based  is  an  important  factor  in  the  validity  of  the  results.  While  the  form  of  a 
decision  rule  may  be  chosen  by  intuition,  necessity,  or  fiat,  the  data  which  provide  training 
statistics  (which  most  rules  use)  Is  never  perfect.  The  data  which  provides  the  testing 
results  must  be  subject  to  similar  scrutiny.  Are  the  corpi  representative  of  speech  (and 

what  does  that  mean)?  Are  they  large  enough?  What  should  be  the  relationship  between 
training  and  testing  data? 

3.2.1  Speech  Data 

The  quality  of  data  one  acquires  in  a body  of  speech  depends  upon  who  is  speaking, 
how  he  is  speak.ng,  and  how  the  recording  is  made.  Many  applications  and  a number  of 
existing  systems  deal  with  isolated  words,  and  thus  avoid  the  variability  introduced  In 
continuous  speech  by  coarticulation,  varying  stress,  and  the  difficulty  of  isolating  the 
proper  segments  to  form  a lexeme.  Variation  in  speaker  falls  into  three  types  - different 
speakers  have  different  gross  vocal  characteristics,  the  most  notable  being  fundamental 
pitch  frequency;  individual  speakers  within  one  gross  type  use  different  portions  of  the 
acoustic  domain  for  particular  phonetic  items  (varying  "dialect");  and  speakers  vary  In  the 
coarticulation  rules  that  may  explain  their  performance.  Effects  of  ambient  noise, 
microphone  characteristics,  and  such  must  be  dealt  with  as  well  at  some  level. 

Speech  recognition  research  has  been  primarily  concerned  with  intra-speaker 
variations.  This  is  not  to  say  that  methods  for  normalizing  inter-speaker  variations  are  not 
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a significant  part  of  the  problem,  but  it  is  generally  felt  that  a model  capable  of  dealing 
with  one  speaker  well,  will  be  extendible  to  the  multi-speaker  problem.  This  view  arises 
from  the  observation  that  nearly  every  sound  within  the  usual  domain  of  speech  may  be 
expected  to  be  produced  by  a speaker  in  some  context.  Furthermore,  we  understana  total 
strangers,  albeit  those  with  unusual  accents  cause  more  difficulty,  with  very  little  training 
effort.  Certainly,  the  acoustic  recognition  process  in  human  (or  machine)  perceptidn  of 
speech  is  little  affected  by  changes  of  a phonological  nature.  These  are  handled  (or 
should  be)  by  higher  level  transformations.  Whereas  changes  in  the  interpretation  of 
acoustic  gestures  from  one  speaker  to  another,  while  seemingly  a more  fundamental 
normalization  problem,  appear  to  be  the  kind  of  problem  more  amenable  to  solution  by 
some  fairly  straightforward  iterative  adjustment  or  learning  techniques.  There  must  be  a 
common  structure  to  the  patterns  we  process  that  is  very  pervasive.  The  ability  of  a 
parametric  representation  to  measure  just  that  structure  will  become  apparent  by 
experiments  with  no  consideration  given  to  speaker  normalization.  It  can  bo  observed  that 
almost  as  much  variation  exists  within  one  speaker’s  performance  as  between  two  similar 
speakers’. 

A final  consideration  is  that  of  recording  quality  and  related  issues.  Some  difficulty 
Is  Introduced  when  microphone  characteristics,  for  example,  impose  varying  spectral 
characteristics,  or  when  noise  influences  the  values  of  some  parameters.  However,  noise 
subtraction  and  spectral  leveling  methods  can  be  employed,  and  good  quality  equipment 
and  quiet  recording  conditions  are  reasonable  for  these  experiments. 

The  following  decisions  seem  reasonable  for  training  and  testing  data:  Continuous 
speech  by  adult  male  speakers  will  bo  used  for  testing.  The  sentences  will  bo  drawn  from 
a variety  of  task  domains  and  therefore  represent  various  words  and  sentences  for 
varying  phonetic  contexts.  All  recordings  will  be  on  reasonably  high  quality  equipment 
with  low  ambient  noise,  but  no  excessive  concern  with  this  seems  warranted. 
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3.2.2  Quantities 

It  is  very  difficult  to  acquire  large  numbers  of  sample  speech  patterns  which  are 
properly  labeled.  This  is  a common  problem  to  other  pattern  classification  applications, 
although  for  a variety  of  reasons.  In  speech,  the  difficulty  is  in  properly  segmenting  and 
labeling  by  hand  the  training  and  test  corpi,  Although  refinement  methods  will  help  in 
training  with  poorly  labeled  data,  the  best  possible  hand  segmentation  and  labeling  should 
be  used.  Testing  can  produce  valid  estimates  of  accuracy  only  if  the  test  data  is  also 
properly  marked,  and  the  validity  of  the  estimates  will  increase  with  the  quantity  of  data. 
There  is  a strong  tendency,  therefore  to  try  tor  as  much  data  as  possible,  and  to  make 
what  data  is  available  serve  both  purposes. 

One  can  over-design  or  over-train  a rule  to  the  point  of  degrading  performance  on 
data  other  than  the  training  set.  This  phenomenon  depends  upon  the  type  of  classifier 
rule,  but  also  upon  the  fact  that  the  data  may  not  be  truly  representative.  As  more  data  is 
used,  this  iatter  becomes  less  likely.  The  size  of  the  training  set,  then,  should  relate  to  the 
complexity  of  the  classifier.  More  complex  classifiers  will  be  better  able  to  separate  the 
training  sample  clusters,  and  thus  take  on  their  particular  structure,  so  one  would  want 
that  structure  to  be  more  representative  of  speech  in  the  pattern  space.  For  example,  a 
nearest-neighbor  rule,  trained  on  a few  samples,  mcy  allow  a few  spurious  samples  to 
capture  large  areas  of  what  should  be  another  class  because  no  representatives  of  that 
othar  class  were  in  the  training  data.  A simpler.  Euclidean  distance  classifier  usi-g  the 
sample  means,  would  be  less  affected  by  the  few  bad  samples. 

The  number  of  dimensions  of  the  pattern  space  also  plays  a role..  Data  points  in 
high  dimensional  spaces  will  be  remarkably  far  apart  and  thinly  spread.  If  one  wishes  to 
estimate  a distribution  in  such  a space,  one  needs  a large  number  of  points  to  fill  In  a 
histogram,  fewer  to  produce  a valid  covariance  matrix,  and  fewer  still  to  estimate  the 
mean.  The  simple  classification  rules  may  be  as  good  when  trained  on  a few  samples  as  on 
many  for  these  reasons. 
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Meisel  [Mei72]  points  out  that  the  ratio  of  the  number  of  samples,  N,  to  the  number 
of  features,  m,  should  be  significantly  larger  than  unity.  His  experimental  results  indicated 
an  N/m  of  3 - 5 usually  yields  successful  training.  However,  consider  the  case  where  100 
two-valued  patterns  are  available.  If  someone  maliciously  adds  98  values  to  each  pattern, 
all  drawn  from  the  same  distribution,  the  results  will  still  be  as  good  as  In  the  case  where 
N/m  was  50.  Thus  the  "true"  dimension  of  the  patterns  must  be  considered.  Meisel  also 
gives  as  example  an  experiment  where  10  two-valued  samples  are  drawn  from  the  same 
distribution  for  each  of  two  supposedly  different  classes.  A linear  transformation  could  be 
found  which  allowed  a perfect  classification  with  a linear  boundary.  fs\,  when  90  more 
samples  were  produced,  no  such  transformation  could  be  found.  The  two  clusters  had 
merged  perfectly. 

The  expected  performance  of  rules  in  practice  can  only  be  estimated  by  examining 
their  performance  on  separate  test  sets.  (If  distributions  are  available  for  the  expected 
inputs,  theoretical  bounds  can  be  derived  for  the  expected  error  rates.)  Either  the 
available  samples  must  be  divided  into  training  and  test  sets  for  this  purpose  or  else  an 
Iterative  process  as  follows  can  be  used. 

Divide  the  samples  into  k sets:  $1...$^ 

Train  on  all  but  S,  and  test  on  Sj,  for  each  i. 

It  can  be  argued  that  if  K is  the  number  of  samples,  this  is  equivalent  to  testing  on  the 
training  set.  Yet  there  are  some  methods  — Potential  functions  or  Nearest  Neighbor  — 
which  are  guaranteed  to  perform  perfectly  on  the  training  set  that  will  not  necessarily  do 
so  in  this  case. 

The  training  sets  to  be  used  in  this  research  consist  of  enough  data  to  provide 
between  30  and  150  sample  segments  for  e*ch  of  about  40  ccn.mon  phones  of  English. 
While  this  is  not  really  enough  to  provide  good  estimates  of  all  the  major  allcphonic 
variants  of  each  phone,  the  difficulties  of  acquiring  the  extremely  fine  hand  Sfejmentatlon 
and  labeling  needed  to  avoid  introducing  errors  in  training  have  precluded  using  more 
data.  While  this  gives  a rather  small  N/m  value  for  m-128  (in  the  SPG  case).  It  must  be 
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noted  that  the  SPG  parameters  are  highly  correlated  with  one  another  and  are,  In  fact, 
derived  from  15  LPC  parameters.  No  doubt  the  "true"  dimensionality  of  the  space  Is 
considerably  lower. 

The  test  data  is  representative  of  the  kind  of  speech  to  be  encountered  by  actual 
speech  understanding  systems,  For  the  near  future,  that  seems  to  be  cooperative 
speakers,  high  quality  low  noise  conditions,  somewhat  limited  vocabularies,  and  continuous 
speech.  To  that  end,  we  have  chosen  to  train  with  a separate  set  of  27  sentences  spoken 
by  each  speaker.  These  contain  approximately  1200  phonetic  segments,  and  are  designed 
to  contain  a number  of  instances  of  the  most  commonly  occuring  allophones  of  English. 
[Sho74b] 

r 

3.3  Summary 

We  have  discussed  two  issues  of  considerable  importance  to  recognition  of  speech: 
recognition  targets  and  data  quality  and  quantity.  Wa  have  tried  to  make  choices  in  these 
areas  which  are  reasonable,  given  the  limited  resources  available.  Considerably  more  data 
is  being  dealt  with  in  this  research  than  has  been  the  case  in  past  efforts.  Training  and 
testing  sets  are  of  the  order  of  1000  segments  and  are  separate  data.  Recording  Is  still 
over  high  quality  microphones.  The  choice  of  template  recognition  Instead  of  feature 
recognition  is  one  of  avialable  methods,  and  personal  preference. 
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Chapter  4 

Speech  Recognition  Systems 

It  is  impossible  to  study  the  parametric  level  without  paying  some  attention  to  th© 
total  systems.  In  recent  years,  there  have  been  a number  of  implementations  of  speech 
recognition  systems  with  a variety  of  Knowledge  sources,  control  mechanisms,  and  data 
structures.  Some  of  these  systems  have  understanding  of  the  content  of  the  utterance  as 
part  of  their  power.  Others  may  be  called  word  recognition  systems  and  are  oriented 
towards  isolated  word  recognition  by  uniform  strategies,  or  with  limited  knowledge 
sources.  All  these  systems  have  a component  which  analyzes  the  parametric  input. 
Almost  all  produce  a phonetic-like  transcription  of  the  utterence  as  some  internal 
intermediate  representation.  This  chapter  is  a brief  survey  of  the  more  salient  aspects 
and  the  available  performance  measurements  of  a number  of  systems  in  current 
developement. 

4.1  The  Parametric  Level 

The  previous  discussions  have  dealt  with  the  individual  aspects  of  acoustic-phonetic 
processing  in  speech  understanding  systems:  the  parametric  representation,  the  roles  of 
segmentation  and  of  labeling,  pattern  classification  techniques,  costs,  and  quality  and 
quantity  of  data.  In  one  sense,  this  covers  most  of  the  background  material  for  the 
particular  experimental  results  to  be  presented.  However,  it  is  often  important  to  view 
such  results  with  the  perspective  of  other  research  in  the  area.  Indeed,  a large  part  of 
the  effort  spent  In  this  work  was  spent  in  developing  methods  which  could  perform 
reasonably  well  by  current  standards  and  yet  which  would  not  be  specific  to  any 
particular  parametric  representation.  This  chapter  is  a brief  survey  of  the  parametric 
level  analysis  performed  In  a number  of  current  speech  understanding  systems  or  partial 
systems.  However,  it  is  almost  impossible  to  meaningfully  compare  their  performance 
results  in  a quantitative  manner  since  they  us©  different  representations  for  output  of  the 
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acoustic-phonetic  information,  since  the  published  results  involve  widely  varying  test 
conditions  and  methods  of  evaluation,  and  since  the  design  goals  of  the  various  programs 
dif'er  considerably  depending  upon  the  structure  and  goals  of  the  complete  systems  of 
which  they  are  sub-parts.  This  survey  merely  seeks  to  provide  a framework  for  better 
understanding  of  the  role  of  acoustic  parameters,  their  use,  end  the  kind  of  performance 
currently  available.  In  addition  to  the  mechanical  speech  recognition  work,  we  will  cite 
some  human  performance  results  [Sho75a]  in  order  to  better  place  machine  performance 
in  perspective. 

Since  we  have  had  only  limited  contact  with  many  of  the  systems  currently  being 
developed,  we  have  had  to  rely  upon  published  descriptions  of  their  methods, 
performance,  and  overall  structure.  In  this,  we  were  greatly  aided  by  an  in-depth  survey 
of  four  acoustic-phonetic  levels  of  large  systems  by  Hieronymus  [Hie75].  However,  come 
assertions  and  methods  of  evaluation  which  have  been  encountered  in  the  speech 
recognition  literature  seem  to  be  biased  by  pre-conceived  notions,  or  to  reflect  poor 
techniques.  For  example,  Hieronymus,  in  summarizing  vowel  recognition  for  the  system  at 
Lincoln  Labs,  points  out  that  50^  vowel  identification  accuracy  for  first-choice  Is  very  poor 
since  humans  find  this  task  so  easy.  Yet  Shockey  and  Reddy  discovered  that  human  vowel 
perception  --  using  auditory  input  — of  continuous,  but  unfamiliar,  speech  was  not 
significantly  better,  perhaps  55^  to  607.  This  mistaken  belief  that  humans  do  very  well 
because  they  have  some  spectacularly  successful  auditory  mechanism  has  doubtless  led  to 
a great  deal  of  misdirected  effort. 

A number  of  published  results  appear  to  have  been  based  upon  testing  with  the 
same  data  corpus  used  for  training  (if  training  is  done)  or  gathering  of  statistics  to 
describe  recognition  targets.  Even  "tuning"  or  deriving  rules  from  the  same  data  used  'or 
testing  can  bias  results.  It  is  an  understandable  mistake  to  make  since  much  speech  data 
seems  to  us  to  be  of  equivalent  difficulty  and  quality.  Yet  this  is  a bad  practice.  We  have 
observed  considerable  degrading  of  performance  results  when  separate  test  data  is  used, 
indicating  that  the  results  over  the  training  data  are  artificially  high.  This  Is  a point  that  is 
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strongly  taken  in  the  more  traditional  pattern  classification  literature.  The  manner  and 
extent  of  development  of  classifiers  can  severly  bias  results. 

A final  point  on  the  difficulty  of  directly  comparing  reported  results  Is  that  the 
application  of  phonological  knowledge  greatly  improves  some  of  the  raw  acoustic 
recognition  results  before  they  are  measured.  This  is  unavoidable  since  some  systems 
have  phonological  mechanisms  built  in  at  the  lowest  level,  while  others  apply  this 
knowledge  at  other  levels  --  either  explicitly  or  implicitly  (e.g.,  integrated  into  the  lexical 
entries)  --  or  not  at  all.  One  experiment  with  the  Di^agon  system  has  shown  us  just  how 
much  action  can  result,  in  continuous  speech,  from  this  knowledge.  When  a set  of 
templates  for  phone  recognition,  acquired  by  the  clustering  algorithm  to  be  described  in 
chapter  7,  was  used,  word  accuracy  of  the  system  dropped  dramatically  from  previous 
results.  Yet  when  new  phonological  descriptions  (all  drawn  from  inspection  of  the  training 
data  only)  were  integrated  into  the  word  lexicon,  performance  was  improved  to  better 
than  957,  over  a separate  test  set  of  sentences.^ 

There  are,  therefore,  a variety  of  reasons  for  the  relative  lack  of  comparative 
studies  at  this  (or  other)  levels  of  speech  understanding  systems.  The  least  that  a look  at 
the  current  efforts  can  show  is  the  essentially  central  nature  of  the  parametric 
representation,  the  important  role  often  played  by  pattern  classification  techniques,  and  a 
fairly  broad  consensus  on  the  role  of  acoustic-phonetic  analysis  in  the  total  understanding 
task. 


4.2  Large  Systems 

The  following  descriptions  are,  of  necessity,  very  brief.  We  cannot  hope  to  do 
justice  to  the  great  deal  of  effort  and  knowledge  that  has  gone  into  these  programs.  We 
are  merely  trying  to  gain  a general  understanding  of  the  nature  and  quality  of  recognition 
at  the  lowest  level  of  a number  of  current  speech  understanding  systems.  In  a number  of 


t Clearly  the  prior  lexical  entries  contained  valid  phonological  descriptions  of  the  words  in 
terms  of  the  old  templates,  but  when  a new  set  of  templates  was  introduced,  a new 
phonetics  was  introduced,  hence  the  need  for  a new  phonology. 
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cases,  labeling  accuracies  are  cited  for  classes  of  sounds  such  as  vowels  or  fricatives.  We 
do  not  know  how  many  phones  are  contained  in  each  of  these  classes,  so  we  cannot 
recommend  detailed  cross-comparisons  of  these  results. 

4.2  1 BBN  Speechlis 

The  Bolt  Boranek  and  Newman  Sp&echlis  system  [Woo74,  Woo75,  SchR75]  acoustic- 
phonetic  recognition  procedures  are  based  upon  a lattice  data  representation  used 
throughout  that  system,  which  allows  alternative  segmentation  and  labeling  decisions  to  be 
maintained.  They  have  chosen  a number  of  fairly  feature-specific  parameters  (i.e., 
designed  to  capture  specific  phonetic  features)  which  are  input  to  a set  of  heuristic 
decision  procedures.  The  stated  philosophy  is  to  deal  with  the  inherent  ambiguities  In  the 
speech  signal  by  allowing  ambiguity  in  the  recognition  process.  First,  segment  boundaries 
are  located  by  looking  for  clues  in  any  one  of  a set  of  special  segmentation  parameters, 
then  labeling  is  performed  on  a different  set^,  averaged  over  the  central  half  of  each 
segment.  This  produces  a broad  class  label  from  [Sonorant,  Obstruent,  Fricative,  Nasal, 
Plosive].  Finally,  class-specific  decision  procedures  are  applied  to  identify  each  segment 
as  one  of  a set  of  36  phones. 

Hieronymus  reports  segmentation  accuracy  of  about  5^  missing  and  6^  extra 
boundaries  on  a small  set  of  data.  Labeling  accuracy  is  about  917.  correct  vowel 
identification  within  three  choices.^  Formant  tracking  and  speaker  normalization  functions 
are  employed  to  benefit  here.  It  is  not  known  whether  this  accuracy  figure  includes  cases 
where  the  vowel  identification  routine  was  not  Invoked  because  of  an  Incorrect  class  label. 
This  does  appear  the  most  sophisticated  aspect  of  their  labeling. 

In  general,  the  structure  of  the  program  seems  interesting,  the  Invocation  of  special 
tests,  formant  normalization,  and  the  alternative  segment  structure  being  often  cited  as 


t LPC  parameters  such  as  formant  frequencies  seem  to  play  an  important  role  In  both 
processes 

^ Vowel  Identification  (presumably  after  the  vowel  segment  has  been  located  and  classed 
as  such)  is  often  cited  as  an  important  statistic.  It  sometimes  Is  the  only  labeling  process 
that  can  be  separated  from  segmentation  or  phonological  analysis. 
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powerful  techniques  to  be  appired  at  this  level.  However,  we  suspect  that  these 

preliminary  results  should  be  treated  carefully  until  more  detailed  performance  analysis  Is 
forthcoming. 

4.2.2  CMU  Hearsay  I and  II 

There  are  three  systems  being  developed  at  Carnegie-Mellon  University:  Hearsay  I, 
which  is  not  being  carried  any  further,  Hearsay  II,  the  research  system  for  a great  deal  of 
the  current  effort,  and  Dragon  and  related  systems,  developed  by  Baker  [BakJK75b]  and 
extended  by  Lowerre  [Low76].  We  will  discuss  Dragon  later  in  this  chapter. 

Hearsay  1 [Red73]  was  developed  to  test  the  Hypothesize  and  Verify  paradigm  as 
well  as  to  provide  a system  which  give  relatively  great  importance  to  higher  level 
knowledge  sources.  The  acoustic  parameters  are  six  derived  parameters  based  upon  the 
amplitude  and  zero-crossing  measures  from  octave  bandpass  analog  filters  (ZCC  mentioned 
earlier).  A pseudo-phone  (PP)  label  is  placed  on  the  signal  every  10  ms.  The  stream  of 
PPs  is  smoothed  and  their  broad  class  memberships^  yield  a first  segmentation.  Then 
some  correction  and  further  segment  identification  is  made  using  additional  parameters 
designed  to  measure  overall  energy  and  locate  points  of  maximal  energy  such  as  vowels. 
The  same  classification  function  Is  used  to  verify  phones  hypothesized  by  higher  level 
knowledge  sources  as  is  used  to  label  the  segments.  Labeling  is  based  upon  the  Euclidean 
distance  from  the  input  10  ms.  sample  parameters  to  a set  of  speaker-specific  templates 

for  the  PPs.  These  are  trained  on  a list  of  ..eutral-context  words,  one  training  segment 
per  label. 

The  acoustic-phonetic  processing  in  Hearsay  I is  fairly  crude,  and  the  good 

performance  of  the  system  is,  to  a large  extent  due  to  correct  application  of  syntax  and 
semantics. 

The  procedures  developed  for  use  in  this  thesis  research  are  being  employed  as  the 
parametric  level  for  Hearsay  II  [Erm75].  Some  of  the  parametric  representations  which  we 


t Silence,  Fricative,  Voiced 
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test  here  are  currently  being  considered,  This  is  not  a critical  design  decision,  however, 
because  of  the  flexible  nature  of  these  routines.  The  set  of  phone-like  labels  for  the 
acoustic-phonetic  transcription  output  o*  this  level  are  re-processed  at  a higher  level.  A 
f.et  of  ternary  phonetic  features  is  assigned  to  each  label  and  weighted  according  to  their 
relative  importance.  A simple  algebra  of  features  and  weights  Is  then  available  to  combine 
alternative  labels  with  scores,  and  to  convert  back  to  phonetic  labels  if  so  desired.  This 
information,  along  with  stress  contour  analysis  to  locate  syllabic  nuclei,  location  of  stop 
consonant  patterns  in  the  sequence  of  acoustic  segments,  a.>d  recombination  of  similar 
features,  produces  hypotheses  of  segments  which  may  overlap  in  time  or  have  multiple 
alternative  labels  (much  as  in  the  BBN  lattice). 

A major  concern  is  to  avoid  disastrous  errors  by  paying  the  higher  cost  of  keeping 
many  hypotheses  around.  Thus,  the  segmentation  is  tuned  to  miss  as  few  boundaries  as 
possible  (approximately  21.).  The  use  of  phonetic  features  at  the  next  level,  where 
recombifiing  of  segments  as  well  as  hypothesizing  and  verifying  of  labels  Is  done,  allows 
the  partial  match  of  correct  phones  where  a less  conservative  system  might  reject  them. 
A final  note  about  Hearsay  II  is  that  it  is  probably  the  most  flexible  overall  system 
organization  being  developed  for  speech  understanding.  There  is  nothing  in  the  global 
data  representations  nor  In  the  control  structures  to  preclude  applying  knowledge  at  any 
time  to  various  parts  of  the  data.  Such  decisions  are  made  by  the  knowledge  sources 
themselves  in  an  asynchronous  fashion. 

4.2.3  SDC  VDM  System 

The  VOM  System  of  System  Development  Corporation  is  oriented  toward  verifying, 
at  the  acoustic-phonetic  level,  a string  of  phones  hypothesized  by  other  levels.  [Rit74] 
Lexical  entries  are  used  to  generate  phonemic,  then  phonetic,  and  finally  parametric 
representations  of  hypothesized  syllables  through  the  application  of  lexical  lookup, 
phonological  rules,  and  parametric  mapping  procedures.  Then  a syllable  mapping  process 
attempts  to  match  the  observed  parameters  with  the  hypothesized  ones.  (Actually,  word 
beginnings  are  found  in  a bottom-up  segmentation.)  A number  of  coarticulation  rules  are 
implemented  in  a very  flexible  structure. 
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Vowel  recognition  is  reported  at  lit  for  three  choices  (A87.  for  first  choice)  with  an 
additional  117.  very  nearly  the  correct  vowel.  Segmentation  of  vowels,  extremely 
important  in  a syllable-oriented  system,  was  917  correctly  found.  Fricatives  (except  /th/) 
were  generally  in  the  807  recognition  range. 

As  an  overall  view,  there  seems  to  be  a great  deal  of  phonological  and  coarticulative 
knowledge  being  applied  by  this  system  as  an  integrated  part  of  the  parametric 
recognition  processes.  This  makes  it  difficult  to  use  the  reported  results  In  direct 
comparison  with  other  parametric  level  recognizers.  Some  experiment*!  results  for 

individual  pieces  of  this  system  are  available  [Mol74,  Gil74,  Kam74]  but  the  conditions  of 
testing  vary  in  quality. 

4.2.4  MIT  Lincoln  Labs 

The  speech  understanding  system  developed  at  MIT  Lincoln  Laboratories  seems,  at 
this  time,  to  have  the  most  sophisticated  acoustic-phonetic  analysis  of  the  systems 
mentioned  thus  far.  [For74,  Wie74]  It  is  a bottom-up  system  with  phonetic  segments  and 
labels  being  produced  from  the  acoustic  parameters  without  help  from  higher  level 
knowledge  sources.  Essentially,  segmentation  is  performed  and  formant  tracking  proceeds 
first.  Pitch  and  frication  detection  is  also  done.  Then  a broad  assignment  of  [Vowel-like, 
Dip,  Fricative,  Stop]  is  made,  and  specialized  identification  routines  are  applied.  In  the 
case  of  Vowei-like  segments,  some  further  segmentation  of  semivowels  and  other  voiced 

consonants  is  performed.  A final  stage  merges  and  edits  the  various  decisions  made  thus 
far. 

The  results  of  labeling  vowels  are  417  and  697.  for  first  choice  and  first  three 
choices,  respectively.  Dips  were  recognized  with  827,  and  fricatives  with  917  accuracy.  It 
must  be  noted  that  these  classes  were  rather  broad.  A single  class  each  represented 
nasalc,  glides,  and  liquids.  There  was  however,  careful  attention  given  to  testing 
conditions,  the  data  was  described  in  the  reports,  and  one  is  Inclined  to  believe  that  the 
above  results  are  reliable  estimates  of  the  performance. 

t As  more  details  of  the  testing  were  reported,  these  results  may  be  more  reliable  than 
some  others. 
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4.2.5  IBM  Rcsoarch  **  GLODIS  with  Speech  Knowledge 

The  IBM  Speech  Processing  Group,  using  the  General  Language-Oriented  Decision 
Implementation  System  (GLODIS),  has  implemented  a number  of  heuristic  rules  for  phoneme 
identification.  [Dix75a]  The  Inputs  to  this  system  are  a digital  spectrum  (from  10  kHz. 
digitized  data),  and  energy  measure,  a spectral  change  measure,  and  the  five  best  results 
of  a spectral  match  function  (from  a set  of  30  to  40  phonetic  targets).  After  application  of 
the  phonetic,  phonological,  and  prosodic  rules,  overall  recognition  accuracy  of  61.7^  Is 
achieved.  In  broad  classes,  accuracy  Is  88.6'2.  Segmentation  results  are  also  reported:  6.9^ 
missing;  10.5'5I  extra. 

A second  stage  has  been  added  to  this  system,  consisting  of  a dictionary,  statistical 
language  model,  and  probabilistic  match.  Sentence  level  accuracy  is  reported  as  85  and 
word  level,  987..  8.5  minutes  of  speech,  consisting  of  6175  segments  were  analyzed. 
[Dix75b] 

4.3  Other  Models  and  Systems 

4.3.1  Dragon  --  Hidden  Markov  Process 

In  discussing  the  Dragon  system  [BakJK75a],  we  would  like  to  make  the  following 
observation.  In  a system  capable  of  utilizing  all  the  results  of  the  acoustic  parameter 
recognition,  raw  labeling  accuracy  may  appear  to  be  very  poor,  yet  the  labeling  routines 
can  support  extremely  accurate  recognition  of  higher  level  elements  provided  the 
"correct"  labels  are  available.  The  higher  levels  must  also  understand  the  set  of  labels. 
They  must  be  able  to  use  the  results  of  acoustic  recognition  optimally. 

The  recognition  statistics  presented  later  for  the  BAK  distance  metric  (which  uses 
zee  parameters  that  have  been  amplitude  normalized)  represent  the  primative  decision 
rule  performance  in  Dragon.  88  templates  for  33  phone-like  labels  were. found  using  the 
clustering  algorithm  described  in  chapter  7,  and  Dragon  was  run  on  a separate  set  of  test 
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utterances.  Very  poor  results  were  obtained.  However,  it  was  observed  that  a single 
critical  error  in  one  word  could  abort  the  entire  correct  uttei'ance,  because  of  Dragon’s 
Markov  chain  representation  and  certain  unconstrained  aspects  of  the  grammar.  It  was 
imperative,  therefore,  that  the  lexicon  (in  which  all  the  phonological  knowledge  is  encoded) 
include  all  reasonable  realizations  of  each  word  so  as  not  to  be  a source  of  a critical 
rejection.  This  was  accomplished  by  studying  the  low  level  recognition  stream  of  words  in 
their  occurences  in  the  training  data  if  the  word  caused  a problem  in  the  test  data. 
Essentially,  the  lexicon  was  being  trained,  or  rather,  word-specific  phonological  knowledge 
was  being  acquired.  Although  only  the  training  data  was  used  to  develop  this  Knov.’ledge, 
the  results  on  separate  test  sentences  from  a fairly  unconstrained  grammar  and  moderate 
sized  lexicon  (250  words)  exceeded  the  957.  word  recognition  level.  One  speaker  was 
used  for  all  these  sentences. 

T'iis  experiment  serves  as  an  existence  proof,  then,  that  correct  machine  recognition 
of  continuous  speech  can  be  based  upon  what  researchers  have  traditionally  considered 
low  performance  at  the  acoustic-phonetic  level.  It  also  points  out  the  need  to  Include 
aspects  other  than  first  choice  statistics  in  any  analysis  of  labeling  performance. 

4.3.2  Dynamic  programming 

Dragon  has  no  separate  segmentation  process;  rather,  the  probability  of  each  word 
in  every  time  interval  is  carried  through  the  intern?'  representation.  A similar,  non- 
segmenting approach  used  by  Itakura  for  isolated  word  recognition  [Ita75,  Ich73]  is  the 
Dynamic  Programming  model.  In  this  model,  stored  parametric  representations  of  an  entire 
word  or  phrase  are  compared  against  the  input.  Time  justification  is  accomplished  by  a 
dynamic  warping  of  time,  within  certain  limits,  so  as  to  optimize  an  overall  pattern  match 
score.  A primitive  parameter  pattern  match  rule  is  still  needed,  to  be  applied  at  regular 
intervals  (15  ms.)  and  Itakura  introduces  the  minimum  prediction  residual  --  the  log 
likelihood  ratio  of  one  interval  of  the  signal  being  predicted  by  the  LPCs  derived  from  the 
corresponding  template  interval. 

The  results  of  extensive  experimentation  with  this  system  were  97.37.  correct 
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recognition  (1,657.  rejection)  over  2000  test  utterances  compared  jgainst  2(X)  template 

utteraices.  The  utterances  were  Japanese  geographical  names  of  0.6  seconds  average 

duration.  [Ita75] 

We  mention  this  program  even  though  it  differs  in  many  ways  from  the  task  and 
structure  of  the  continuous  speech  understanding  systems.  It,  like  Dragon,  demonstrates 
the  power  of  a simple,  uniform  model  for  applying  the  results  of  parametric  level 

recognition,  and  it  presents  an  interesting  use  of  the  LPC  model  to  directly  estimate 

similarity  of  acoustic  signals.  It  is  not  clear  whether  this  approach  could  be  extended  to 
continuous  rneech,  or  even  whether  results  would  be  good  over  a different  set  of 
utterances.  But  this  clearly  fills  in  one  more  point  in  the  space  of  current  technology 
available  for  certain  speech  understanding  tasks. 

4.3.3  Other  Efforts 

Two  other  efforts  deserve  ai  least  brief  mention.  Hess  [Hes74]  reports  a pitch- 
synchronous  approach  to  parameter  extraction.  With  pitch  synchronous  non-harmonic 
analysis  (apparently  similar  to  some  of  the  LPC  methods)  he  Is  able  to  do  careful  formant 
ti  acking.  Segmentation  produces  alternoling  steady  and  transition  regions  with  labeling 
accuracy  reported  at  857  to  907  over  a set  of  24  labels.  (These  are  results  of  testing  on 
the  training  corpus,  about  1700  segments.) 

Haskins  Laboratories  [Mer75]  has  been  developing  strategies  for  parametric 
recognition  by  studying  human  protocols  on  certain  spectrogrim  reading  tasks.  From  this 
they  have  developed  Phonetic-Context  Controlled  methods  for  segmentation  and  labeling. 
The  results  of  human  performance  for  spectrogram  reading  include  location  of  reference 
words  in  similar  contexts  and  in  different  contexts  and  phoneme  Identification  without 
.reference  spectrograms  available.  [Coo74]  While  these  showed  707  to  807  reference 
word  identification,  and  about  707  phoneme  identification,  correct  words  were  found  only 
507  In  the  third  experiment.  In  addition,  the  language  used  was  English  (presumably  the 
language  most  familiar  to  the  subjects)  and  thus  use  of  higher  level  knowledge  could 
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hardly  have  been  avoided.  Indeed,  acquiring  this  knowledge  was  a large  part  of  the  Intent 
of  the  experiments. 

4.4  Human  Performance 

The  implications  of  human  performance  lO  designers  of  speech  understanding 
systems  have  often  been  misunderstood.  Shockey  and  Reddy  [Sho74a]  point  out  two 
phenomena  found  in  the  discussions  of  human  performance  and  computer  speech 
recognition.  The  over-expectation  phenomenon  results  in  a tendency  to  expect  too  much 
from  machine  recognition  at  certain  levels  (such  as  the  parametric  analyses  studied  here) 
because  humans  seem  to  do  so  well.  Under-expectation  occurs  because  of  the  poor 
results  in  the  past,  and  has  led  a very  great  reliance  upon  lexical,  syntactic,  and  semantic 
constraints.  The  suggestion  is  that  a balance  should  be  struck.  At  the  acoustic-phonetic 
level,  one  ought  not  expect  recognition  of  things  that  are  not  there  (acoustically).  Neither 
should  one  forgive  not  recognizing  things  that  are  present  in  the  acoustic  parameters. 
The  problem  in  performance  analysis  at  this  level  is,  therefore,  to  determine  just  what  Is 
present.  This  appears  to  be  a strong  plea  for  an  acoustic  approach  to  segmentation  and 
labeling,  leaving  phonology  and  phoneme  recognition  to  other  knowledge  sources.  This 
approach  is  supported  by  the  results  of  the  experiments  with  connected  speech 
transcriptions  of  unfamiliar  languages  reported  by  Shockey  and  Reddy.  These  results  may 
serve  as  another  point  in  the  performance  framework  with  which  one  may  view  the  work 
at  this  level. 

Fifty  short  utterances  In  11  languages  were  recorded.  Ten  of  the  languages  were 
Unfimiliar  at  all  levels  (even  the  phonological  level)  to  th^  subjects.  Correct  Identification 
of  phonetic  elements  was  measured  for  transcriptions  made  by  the  subjects,  who  were 
expert  at  phonetic  transcription  techniques.  The  stimuli  were  auditory  speech, 
spectrograms,  or  waveforms.  Accuracy  of  identification  into  70,  14,  and  5 classes  was 
measured.  Auditory  input  supported  results  of  56^,  665t,  and  787.  for  the  three  sets  of 
classes.  Spectrogram  and  waveform  were  both  very  similar  at  about  245!,  465!,  and  675!. 
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In  the  light  of  reported  vowel  identification  results  of  some  speech  understanding  systems, 
it  is  interesting  to  note  that  the  computers  are  doing  better  than  phoneticians  working 
with  spectrogram  or  waveform  data  and  about  as  well  as  auditory  perception.  When 
vowel  performance  was  compressed  into  6 overlapping  classes:  high,  mid,  lovk,  iront, 
center,  and  back  the  results  were  about  667.,  4^17,  and  ^S7.  for  the  three  presentation 
media,  respectively. 

A different  approach  to  studying  human  performance  on  spectral  parametere  was 
taken  by  Klatt  and  Stevens  [Kla72].  Here  spectiograms  of  unknown  English  sentences 
from  an  unknown  but  fairly  simple  grammar  and  vocabulary  wore  used  in  transcription  and 
word  identification  experimer*'...  The  object  was  to  study  the  methods  of  search, 
particularly  at  the  lexical  lookup  interface.  Total  segment  Identificstion  performance  was 
reported  at  337.  correct,  407,  partially  correct,  177.  in  error,  and  107  omitted  segments. 
Vowel  and  consonant  identifications  were  each  similar  with  the  exception  of  a much  lower 
omission  rate  for  vowels.  An  interesting  result  of  the  study  of  lexical  interactions  was 
that  most  of  the  searches  initiate'  did  not  yield  the  correct  word.  However,  after 
extended  interaction,  (and  obviously  application  of  some  higher  level  knowledge)  word 
recognition  was  improved  to  967,. 

There  are  a large  number  of  other  experiments  which  deal  with  Issues  of  human 
perception  of  speech,  although  they  are  often  intended  to  reveal  aspects  somewhat 
irrelevant  to  this  dissertation,  such  as  the  perception  of  altered  speech,  superimposed 
sounds,  corttext,  dialect,  speed,  or  stress.  We  am  concerned  here  with  the  much  more 
basic  problem  of  robust  acoustic  identification  and  segmentation.  However,  human 
perception  performance  does  serve  as  another  point  in  the  space  of  speech  understanding 
systems,  as  well  as  an  existence  (or,  sometimes  non-existence)  argument  for  certain 
acoustic-phonetic  correlations.  Ladefoged  [Lad69],  in  discussing  perception  of  vowel 
quality,  points  out  a great  deal  of  ambiguity  in  the  way  vowels  are  perceived  and 
described  by  phoneticians  and  linguists  that  seems  to  Invalidate  much  of  the  detiilled 
analyst  of  vowel  quality  as  being  descriptive  of  real  physical  phenomena.  Dealing  with 
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specific  acoustic  gestures  (with  whatever  phonetic  correlates  thf^^nay  possess)  is  more 
reasonable  than  attempting  to  do  phonetic  or  phonemic  recognition.  Structures  and  rules 
such  as  Markov  chains,  Sniffers  (SOC),  dynamic  programs,  etc.,  may  have  little  to  do  with 
the  kinds  of  things  traditional  linguists  have  studied,  yet  may  be  better  oriented  to 
computer  understanding  of  speech.  It  is  rot  at  all  unreasonable  that  In  the  context  of 
different  systems  (from  humane)  and  different  acoustic  representations,  the  optimal 
phonetic  classes  and  types  of  segments  may  not  be  the  same  as  those  defined  by 
traditional  phoneli*:. 

4.5  Summary 

iH  thi'i  chapter,  we  have  presented  some  of  the  results  and  relevant  aspects  of  a 
variety  of  phone  recognition  components  from  a number  of  current  speech  systems. 
Direct  comparisons  are  very  difficult,  although  some  have  been  mode.  [Hie75]  There  is 
wide  variety  in  the  types  of  knowledge  used,  the  representational  level  for  output,  and  the 
expectations  and  system  demands  which  characterize  these  recognizers. 

From  these  descriptions  it  may  be  possible  to  form  a picture  of  the  uses  to  which 
parametric  lovil  aralyses  will  be  put.  It  is  impossible,  however,  to  form  an  accurate 
estimate  of  the  performance  expected  at  the  state  of  the  art  today.  Depending  upon 
conditions  of  testing,  knowledge  sources  used,  level  of  represention,  etc.,  ttie  reported 
accuracies  may  vary  a great  deal.  There  are  a few  existence  proofs,  even  so,  that  may 
indicate  that  the  state  of  the  art  is  approaching  a point  where  genaral  connected  speech 
understandir.g  will  be  possibl.i.  In  a number  of  limited  domains,  very  high  accura«y  and 
even  real-time  r»sr*>nsmay  be  achieved. 
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This  chapter  presents  a method  for  segmenting  speech  into  acoustically  uniform 
segments.  The  algorithm  is  relatively  independent  of  the  choice  of  parametric 
representation  (except  for  the  use  of  one  amplitude  parameter  specifically  as  such).  The 
basic  operation  is  the  application  of  a distance  metric  to  patterns  adjacent  in  time.  A 
special  metric  has  been  devised  for  this  purpose.  A second  aspect  of  the  method  is  that  it 
employs  a number  of  different  decision  functions. 

5.1  Role  of  Segmentation 

'here  is  a very  strong  interdependence  between  the  performance  of  segmentation 
and  of  labeling  at  the  acoustic  ievei  and,  indeed,  at  higher  levels.  The  dependence 
operate.*:  (in  one  sense)  because  of  the  need  for  labeling  to  be  performed  on  the  "target” 
areas  of  the  signal  --  those  portions  of  the  phonetic  gestures'  duration  in  which  the 
articulators  are  closest  to  their  intended  positions,  during  which  the  excitation  source  is 
most  stressed  and  steadiest,  and  during  which  coarticulation  effect  may  be  minimized.  In 
the  opposite  sense,  interaction  occurs  because  segmentation  is  highly  dependent  upon 
context.  As  simple  a cue  to  segmenting  as  amplitude  or  energy  in  the  signal  is  much  less 
meaningful  during  strong  fricatives,  such  as  /s/  and  /z/,  than  during  most  voiced  segments. 
In  the  former  cases,  amplitude  may  carry  no  phonetic  information  at  ail.  In  the  latter,  it 
often  signals  some  Important  change,  such  as  a vowei/giide  jun.cture,  and  may  be  the  only 
robust  signal  of  some  pathological  cases.  Thus  the  information  gained  in  classifying  the 
phonetic  context  of  an  acoustic  change  in  the  signal  can  be  very  important  in  determining 
its  relevance  to  phonetic  changes. 

Different  approaches  have  been  taken  in  a number  of  speech  recognition  programs 
to  deal  with  this  interaction.  The  Hearsay  I system  [Erm74b]  attempted  to  do  labeling 
first,  by  placing  acoustic  labels  on  the  signal  at  regular,  short  intervals.  The  string  of 
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acoustic  labels  was  then  smoothed,  and  points  of  change  in  the  nature  the  labels  were 
located.  Additional  information,  such  as  amplitude  and  the  location  of  amplitude  peaks  and 
dips  was  also  used  to  ii.iprove  this  label-driven  segmentation.  After  changes  in  the  gross 
signal  type  (voiced,  fricated,  silence)  were  located,  voiced  segments  were  further 
subdivided  using  the  amplitude  clues. 

Another,  very  different,  approach  [Ita75]  is  to  ignore  segmentation  entirely  as  a 
separable  process.  Rather,  a dynamic  programming  problem  is  postulated  for  solution.  In 
this  problem,  short  time  windows  at  regular  intervals  are  matched  by  means  of  some 
general  matching  function  (labeling),  the  two  intervals  for  each  match  being  taken  from  the 
input  signal  and  from  a stored  template.  Within  certain  constraints,  time  may  be  warped  so 
♦hat  different  regions  of  the  unknown  are  matched  with  different  regions  of  the  template 
(segmentation).  The  solution  to  the  dynamic  program  Is  the  time  warping  which  "best" 
aligns  the  two  signals;  the  solution  also  provides  a rating  of  their  match  to  be  compared 
with  those  of  other  templates. 

Both  of  these  methods,  as  well  as  othei,  similar  approaches  have  been  successful  in 
limited  speech  understanding  systems.  However,  acoustic  level  input  to  a speech 
understanding  system  of  more  general  scope  must  take  another  point  of  view.  Crude 
segmentation  may  be  sufficient  for  tasks  with  small  vocabularies  and  high  degrees  of 
syntactic  and  semantic  support.  No  segmentation  at  all  is  a possible  approach  In 
recognition  of  short  utteiances  (single  words)  where  the  advantages  of  a uniform 
algorithm  are  not  overweighed  by  an  excess  of  computational  and  storage  requirements 
nor  by  the  chanc^  of  significantly  wrong  paths  being  taken  in  the  "search"  for  an 
appropriate  time  warping.  However,  in  dealing  with  continuous  speech  over  weakly 
constrained  subsets  of  natural  language,  the  difference  between  two  semantic  outputs  may 
rest  upon  locating  a few  glides  end  nasals  as  separate  entities  from  the  surrounding 
voiced  context,  or  may  rest  in  ignoring  large  intervals  of  irrelevant  signal  (silences, 
offglidcs,  etc.),  in  fact,  an  accurate  segmentation  is  more  than  half  the  battle  In  many 
cases,  since  an  accurate  account  of  the  numb  and  general  nature  of  the  phonetic 
elements  can  greatly  reduce  higher  level  searches. 
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5.2  Present  Segmentation  Method 

For  such  reasons,  we  have  been  Investigating  the  feasibility  of  segmentation-first 
machine  transcription.  In  addition,  such  machine-produced  segments  provide  a realistic 
test  of  labeling  performance  for  the  different  parametric  representations.  In  actual 
practice,  a parametrization  of  the  input  signal  must  support  both  processes  of 
segmentation  and  labeling.  If  critical  errors  are  made  in  either,  the  total  performance  will 
suffer.  Because  the  choice  of  representation  Is  a very  open  question,  we  have  attempted 
to  develop  algorithms  which  are,  to  a large  degree,  parametrization-independent.  Although 
certain  prosodic  parameters  are  necessarily  assumed  and  relied  upon  within  the 
algorithms,  the  general  approach  has  been  to  develop  segmentation  and  labeling  at  a level 
where  various  representations  are  treated  in  a uniform  manner.  Thus,  the  programs  that 
result  are  also  useful  research  tools  for  this  comparative  study. 

5.2.1.  Detecting  Change 

The  basic  concept  of  segmentation  is  very  similar  to  the  well  known  signal  detection 
problem.  (A  more  detailed  iiscussion  of  this  model  is  presented  in  the  next  chapter.)  In 
this  case  however,  the  signal  to  be  detected  is  "significant"  change.  Given  some  function  of 
time  which  measures  change  and  which  can  operate  within  a small  time  window  on  the 
input  signal,  we  can  postulate  two  distributions  of  values  for  that  function:  those 
correlated  with  times  of  acoustic  or  phonetic  change,  and  those  correlated  with  times  of  no 
relevant  change.  Unfortunately,  the  form  of  these  distributions  is  not  available  a priori, 
although  we  may  not  care  about  their  form.  So  long  as  a change  measuring  function  Is 
available  which  produces  significantly  different  (higher)  values  at  just  those  times  when 
changes  are  occurring  in  the  phonetic  state  of  the  signal,  the  segmentation  problem  may 
be  solved.  The  actual  distributions  are  useful,  in  signal  detection,  for  choosing  optimal 
thresholds  for  the  detector.  Given  the  costs  associated  with  type  I and  typo  II  errors,  one 
can  balance  the  decision.  However,  the  costs  are  not  really  known  here,  since  they  may 
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rest  upon  much  higher  level  considerations  lihe  phonetic  similarity  of  two  semantically 

dillerent  words,  lor  example).  We  would  hope  to  learn  enough  ahout  the  change  function 
from  empirical  analysis  for  our  purposes. 

From  the  previous  discussion,  it  appears  unlikely  that  any  one  measure  of  change 
will  suffice  for  the  segmentation  of  continuous  speech  since  the  changes  are  context 
dependent.  To  look  at  the  problem  from  a pattern  space  point  of  view,  different  regions 
Of  the  pattern  space  wi|l  change  in  different  ways.  The  intra-class  distances  of  samples  of 
hi,  for  example,  have  been  found  in  some  parametric  representations  to  be  much  greater 
than  those  of  /n/,  and  they  vary  in  different  elements  of  the  patterns.  We  can  take  this 
view  further  and  rely  upon  a distance  measure  in  the  pattern  space  to  compare 
neighboring  (in  time)  patterns  of  the  input  signal  and  to  rate  the  likelihood  of  a 
phonetically  significant  change.  In  addition  to  merely  rating  the  likelihood  of  change,  we 
wish  to  locate  it  in  time  as  accurately  as  possible.  If  tne  resolution  with  which  we  look  at 
the  signal  is  fine  enough,  we  may  assume  that  neighboring  high  values  in  the  distance-of- 
neighbors  function  relate  to  the  same  segment  boundary,  and  we  ought,  thus,  to  choose 
the  time  of  the  highest  value,  the  local  peak. 

5.2.2  Multiple  Decision  Algorithm 

Since  (he  following  discussion  goes  Into  the  details  ol  the  segmentation  program,  the 

reader  may  wish  to  skip  the  rest  ol  this  section  after  looking  at  figure  5.1  lor  a general 
Idea  of  the  method. 

The  first  approach  [GolH7A]  was  to  employ  a single  decision  function  which 
combined  measures  of  both  short  and  long  duration  changes.  This  proved  to  be  too 
inflexible.  No  simple  threshold  could  be  found  on  such  a composite  function  to  separate 
change-related  from  non-change-related  peaks,  If  this  were  because  the  distributions  of 
peaks  both  at  segment  boundaries  and  not  at  boundaries  were  Identical,  then  the  function 
would  be  useless.  However,  we  found  some  value  in  the  function  if  It  was  employed  with 
different  thresholds,  and  with  varying  resolutions  in  time,  during  different  portions  of  the 
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Figure  5.1:  Segmentation  Program 
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signal.  It  appeared  that  the  distributions  were  overlapping  significantly  becaust  they 
were  composites  of  different  kinds  of  phenomena. 

The  current  segmenter,  therefore,  first  attempts  to  locate  points  of  major  change  in 
the  signal  source  and  in  amplitude.  Then  less  robust  functions  are  used  to  segment  within 
these  broadly  separated  regions.  Additionally,  some  correction  rules  are  applied  to  adjust 
certain  cases  where  two  functions  do  not  quite  coincide  in  the  time-placement  of  segment 
boundaries,  but  where  they  are  both  clearly  responding  to  the  same  signal  change.  The 
following  discussion  describes  the  philosophy  of  combining  evidence  from  a number  of 
segmentation  functions,  the  decision  functions  themselves,  and  the  correction  rules.  Then  a 
discussion  is  presented  of  the  training  of  thresholds  and  the  rating  of  segment  boundaries. 
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Figure  5.2i  Speech/Silenco  Detection 
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As  the  first  stage  (see  figure  5.2),  speech  is  separated  from  silence  segments  by 
locating  those  times  when  the  amplitude  parameter  has  dropped  below  a threshold,  Tl.  It 
the  amplitude  also  falls  below  a second  threshold,  T2,  at  some  point  in  the  segment,  it  is 
accepted  as  a silence  region.  (The  acquisition  of  values  tor  Tl  and  T2,  as  well  as 
thresholds  used  by  other  levels,  will  be  discussed  later.)  The  amplitude  parameter  plays  8 
special  role  in  the  segmenter.  Because  of  limitations  in  the  accuracy  of  digitization,  as  well 
as  inherent  shortcomings  of  the  methods,  many  parametric  representations  are  unreliable 
at  low  amplitudes.  It  is  important  that  the  regions  of  the  signal  be  located  where  analysis 
can  be  done  with  greatest  confidence.  Moreover,  amplitude  carries  important 
segmentation  information  which  ought  not  to  bo  overlooked  when,  for  reasons  of 
normalization,  it  is  often  removed  from  the  parameters. 
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Within  the  regions  identified  as  silences,  a third  low  threshold  T3  Is  applied  from 
either  end  of  the  segment  to  separate  onset  and  offgiide  regions,  or  very  low  amplitude 
nasals  or  fricatives,  from  true  silence.  These  regions  may  belong  "phonetically"  with 
neighboring,  higher  amplitude  segments,  but  have  been  divided  by  the  appiicatlon  of 
threshold  T1  from  the  main  part  of  the  speech  region.  It  Is  a serious  question  whether 
this  sort  of  segmentation  is  an  error  for  a mechanical  segmenter.  Our  goal  in  machine 
segmentation  ought  not  to  be  to  duplicate  hand  segmentation  of  phonemes,  but  rather  to 
isolate  those  locations  in  the  signal  which  will  best  support  labeling  and  will  provide  as 
accurate  and  reliable  a map  of  the  acoustic  reality  of  the  signal  as  possible.  Higher  iovel 
rules  In  the  speech  understanding  system  can  then  deal  much  better  with  the  problem  of 
fitting  phonemes  to  the  acoustic  labels  and  segments.  Therefore,  In  isolating  the  low 
amplitudes,  offglides,  etc,,  from  both  speech  and  silence,  we  provide  for  their  possible 
recognition  as  separate  segments  (a  final  nasal  perhaps),  and  we  ensure  that  labeling  of 
the  speech  segments  will  be  performed  as  higher  amplitude  signals,  where  more  accurate 
classification  may  be  expected.  The  above  premise,  that  performance  of  the  acoustic  level 
of  speech  understanding  must  be  measured  at  that  level  and  that  one  cannot  expect 
certain  kinds  of  recognition  behavior  from  the  simple,  local  algorithms  one  encounters 
there,  tv,  very  central  to  our  approach  in  this  research.  We  will  meet  It  again  In  other 
contexts. 

There  is  one  other  speech/silence  boundary  phenomenon  which  must  be  dealt  with 
at  an  intermediate  level.  Flaps  will  not,  in  general,  drop  in  amplitude  to  a level  where  the 
silence  detection  described  above  can  detect  them.  However,  the  flap  does  have  a very 
particular  kind  of  amplitude  contour.  (Figure  5.3)  Thus,  in  the  speech  portions,  short 
periods  of  amplitude  below  a threshold  T|  together  with  preceding  drop  Tp  and  succeeding 
rise  Tj  are  isolated  as  flaps.  Only  durations  of  10  or  20  ms.  are  considered.  We  have 
observed  that  flaps  of  longer  duration  are  adequately  detected  by  the  other  functions  of 
the  segmenter.  Moreover,  it  is  a point  of  phonological  debate  when  an  Intravocallc  stop 
consonant  is  really  a flap.  Thus  the  program  only  labels  as  flaps  such  stops  of  20  ms. 
duration  or  less.  (The  coarseness  of  the  resolution,  10  ms.,  may  allow  stops  of  30  ms.  to 
be  detected.) 
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Figure  5.3:  Flap  Detection  by  Amplitude  Contour 

From  this  point  on,  a number  of  detection  functions  are  used.  Figure  5.4  dispiays 
some  typical  ones,  as  well  as  the  hand  and  machine  detected  boundaries. 

At  the  second  stage,  the  speech  regions  (between  silences  and/or  flaps)  are  subject 
to  the  first  decision  function,  V^.  This  may  be  any  acoustic  distance  function.  In  this  case 
it  is  a vote-for-change  function  computing  in  the  following  manner:  the  difference  of 
successive  parameter  values,  |Pj(t)-Pj(t-l)|,  is  composed  with  a threshold  Rj,  where 
i-1... (number  of  parameters).  If  the  threshold  is  exceeded,  a vote  of  1 is  accumulated.  The 
acquisition  of  Rj  will  be  discussed  later.  This  function  will  peaK  at  a time  when  the 
parametric  representation  is  changing  in  a number  of  its  elements.  The  local  peaks  above 
threshold  T^  are  considered  strong  candidates  for  boundaries  of  significant  change  In 
signal  type. 

At  this  point  a third  decision  function,  Vg,  is  applied.  This  Is  the  magnitude  of  the 
change  In  the  amplitude  parameter  between  t-fl  and  t-2.  This  larger  duration  measure 
was  adopted  because  the  object  of  this  level  is  to  find  fairly  major  boundaries  between 
voiced  segments,  such  as  vowel/nasal  junctions  and  even  certain  vowel/stop  boundaries 
which  escape  the  previous  function  because  the  patterns  are  similar  In  overall  type.  It 
was  found  that  using  shorter  duration  amplitude  changes  introduced  too  many  spurious 
decisions  while  a longer  span  would  confuse  amplitude  changes  of  a gradual  nature  with 
shifts  that  signal  phonetic  change. 

By  now,  boundaries  of  three  different  types  have  been  found,  major  silence/speech 


Figure  5.4:  Some  Detection  Functions  (end  Amplitude) 
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junctures,  points  of  major  change  in  signal  types,  and  significant  changes  of  amplitude 
within  speech  segments.  The  final  decision  level  deals  with  slow  changing  boundaries  of 
the  kind  encountered  in  vowel/glide  junctures  or  In  diphthongs. 

By  the  time  segmentation  has  proceeded  to  this  fourth  function,  we  may  assume  that 
all  the  obvious  boundaries  have  been  detected.  Therefore  one  ought  to  take  a general 
look  at  the  input  patterns  in  order  to  detect  changes  which  persist  for  a length  of  time.  In 
addition,  some  errors  of  omission  from  higher  levels  may  be  corrected  here.  Such  a 
situation  as  an  /I//z/  boundary,  where,  because  of  the  high  third  formant  of  the  /I/  and 
the  voicing  of  the  /z/,  as  well  as  similar  amplitude  levels,  an  obviously  important  boundary 
is  missed,  will  be  corrected  by  a function  that  is  more  sensitive  to  the  acoustic  change  in 
question.  Consequently  a difference  threshold,  Sj,  is  applied  to  each  of  the  pattern 
channels  and  a vote  sum,  V^,  is  taken  similarly  to  that  of  level  two. 

4 

The  differences  above  are  taken  between  the  Input  patterns  at  times  t+1  and  t-2 
for  reasons  similar  to  those  concerning  the  amplitude  difference  function  V,.  This  function, 
Vj,  Is  intended  to  detect  slow  changes  in  the  signal,  the  windowing  gives  a 30  ms.  span  to 
the  observation  of  change. 

There  are  a number  of  situations  where  two  or  more  of  these  functions  respond  to 
the  same  change  phenomenon  yet  locate  it  at  slightly  different  times.  This  may  occur 
because  the  different  functions  are  sensitive  to  different  portions  of  the  pattern,  or 
because  of  scaling  or  windowing  considerations  of  the  particular  parametric 
representation.  Therefore,  after  the  various  levels  mark  the  boundaries,  correction  rules 
are  applied.  For  example,  speech  silence  (level  one)  boundaries  may  be  corrected  10  ms. 
to  the  location  of  level  two  boundaries  in  situations  where  speech  is  going  to  silence.  (In 
speech  onset  cases,  one  cannot  afford  to  miss  short  burst  segments  by  correcting  this 
way,  but  in  off-glide  cases,  there  are  no  such  short  segments.)  V.  and  boundaries, 
being  responses  to  phenomena  spread  over  40  ms.,  may  differ  by  20  ms.  and  be  corrected 
to  the  point  In  this  region  where  a generai  difference  vote  function  is  highest.  These 
correction  rules  have  been  developed  In  an  empirical  manner.  However,  we  have 
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attempted  to  maintain  a sense  of  justification  for  any  such  rule.  The  limitations  of  a 10  ms. 
time-resolution  also  justify  the  assumption  that  errors  will  consistently  be  made  in  the 
time  location  of  more  gradually  changing  boundaries. 

Issue  might  be  taken  with  the  use  of  thresholds  in  so  much  of  this  process.  One  can 
never  rely  completely  upon  thresholds  to  perform  under  varying  conditions.  However  the 
data  to  be  used  in  this  study  has  been  accumulated  and  analyzed  under  fairly  consistent 
conditions,  with  regard  to  noise,  gain,  and  microphone  characteristics.  This  assumption  of 
consistent  data  should  not  affect  the  more  general  applicability  of  these  methods,  since 
amplitude  and  spectral  normalization  techniques  are  fairly  commonly  available.  For 
example,  Itakura  gives  a method  of  normalizing  the  long-term  spectra  which  essentially 
models  the  transducer  characteristics  with  a two  variable  linear  equation.  [Ita76] 

5.2.3  Training 

At  this  point  we  should  discuss  the  problem  of  training—  that  is  acquiring  values  for 
the  various  thresholds  and  weights  mentioned  above.  It  is  important  to  the  goal  of  this 
comparison  that  these  values  be  acquired  in  a uniform  fashion  for  all  parametric 
representations.  Uniformity  was  equally  important  in  order  to  maintain  some  detachment 
from  tne  initial  test  data  upon  v/hich  the  segmentation  program  was  developed. 

The  functions  and  Vj  depend  upon  vectors  of  thresholds.  These  have  been 
derived  empirically  from  a corpus  of  training  data  which  is  segmented  and  labeled  by  hand. 

It  consists  of  27  separate  utterances  of  continuous  speech  of  about  3 or  A seconds  each. 
Given  a parametric  representation,  Pj(t),  where  time,  t,  advances  one  unit  each  10  ms.,  all 
the  times  at  which  hand  segmented  boundaries  occur  (within  the  10  ms.  resolution)  are 
considered.  At  each  such  time,  t,  the  differences,  dj^-!Pj(t)-Pj(t-l)|  and  dj^-|Pj(t+l)-Pj(t-2)| 
are  calculated.  The  greatest  dj^  and  dj^  are  (hen  collected  by  assigning  the  threshold  R| 
to  be  the  least  such  d,l  and  Sj  the  least  dj^  Although  a large  amount  of  data  may  be 
thrown  away,  this  selection  ensures  that  only  the  most  significant  parameters  In  the 
representation  are  given  low  vote  thresholds.  Clearly,  however,  this  method  Is  sensitive 
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to  any  errors  in  the  hand  segmentation,  and  great  care  has  been  taken  with  the  training 
corpus  especialiy. 

The  caicuiation  of  Vj,  Vg,  and  Vg  is  thus  dependent  upon  the  training  corpus  only. 
The  next  step  is  to  describe  a method  for  finding  ‘he  detection  thresholds,  Tj,  Tg,  and  Tg 
used  to  accept  or  reject  peaks  in  these  functions  as  signifying  segment  boundaries.  Signal 
detection  theory  can  help  us  derive  optimal  detection  thresholds  from  statistical  analysis 
of  the  stimuli.  A typical  detection  modei  postulates  a univariate  measure,  L,  of  the  stimulus 
that  relates  to  the  two  events,  signal-plus-noise,  S,  and  noise-only,  N.  Very  often,  a 
convenient  measure  is  the  ratio  *’r{S|input}/Pr{N|input).  It  is  further  assumed  that  the 
distributions  of  L|s  and  L|N  are  distinct,  very  often  idf-ntical  except  for  different  means.  In 
standard  studies  of  signal  detection,  L cannot  be  accessed.  Rather  the  forms  of  the 
distribution  are  ass'imed,  and  the  actuai  performance  of  the  detector  is  used  to  measure 
the  difference  in  distribution  means.  The  basic  model  shows  that  performance,  as 
measured  by  both  Pr  {detection  |S}  and  Pr  {detect  on|N},  can  be  optimized  for  any  set  of 
costs.  As  we  shall  see  in  the  following  chapte**,  when  we  discuss  the  signal  detection 
model  further,  the  choice  of  threshold  is  really  an  arbitrary  tuning  device.  The  choice  of 
decision  function  has  the  primary  effect  on  performance. 

In  the  segmentation  program  described  above,  the  signal  (boundary)  measure  is 
known,  and  we  may  collect  distributions  empirically.  Therefore  histograms  were  collected 
over  a corpus  of  training  utterances  to  estimate  the  distributions  Pr  (L|S)  and  Pr  {L|N}. 
Since  the  measures  Vj,  Vg,  and  Vg  are  coarse  in  time,  a peak  was  considered  to  correlate 
with  a boundary  (S)  if  it  was  within  10  ms.  of  the  hand  marked  boundary.  The  histograms 
of  local  peak  values  within  10  ms.  of  segmental  boundaries  estimate  Pr  {LfS)  for  L-Vj,  Vg, 
or  Vg. 

At  this  point,  we  had  to  decide  what  costs  to  assign  to  errors  and  what  confidence 
to  place  in  the  histograms.  The  distributions  ail  had  the  same  approximate  shape,  although 
somewhat  different  apparent  variances  and  different  means.  Since  the  means  of  the 
populations  could  be  estimated  with  the  most  confidence,  the  thresholds  were  chosen  to 
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be  halfway  betv/een  the  two  distribution  means.  This  corresponds  to  the  model  with  equal 
variances,  same  distribution,  and  equal  costs—  a simplification  of  the  observed  situation. 
The  figure  (5.5)  shows  some  empirical  distributions  for  the  SPG  representation  and  a 
possible  threshold  value  for  one  of  the  three  decision  functions. 


Figure  5.5:  Choosing  Thresholds 

Finally,  the  acquisition  of  the  thresholds  for  silence  and  flap  detection  has  been 
more  ad  hoc  in  nature.  Amplitude  values  wera  collected  over  the  training  corpus  for 
silence  segments  within  utterances  (which  might  be  more  noisy  than  inter-utterance 
silences).  The  mean  and  standard  deviation  were  computed  and  T1  (the  boundary  locating 
threshold)  is  assigned  mean  and  standard  deviation.  T2  (silence  verification)  is  mean,  and 
T3  ( low  amplitude  segment  location)  is  the  mean  of  amplitude  readings  over  silence,  /b/, 
/d/,  and  /g/  segments,  tending  to  be  slightly  higher. 

The  flap  thresholds  were  chosen  by  observing  the  behavior  of  the  amplitude 
function  In  a small  number  of  cases.  The  flap  Is  a rare  enough  phenomenon  so  that  it  is 
dilficult  to  collect  adequate  statistical  Information  about  It.  Moreover,  since  flaps  are  one 
specific  performance  and  occur  In  limited  contexts,  it  is  unlikely  that  the  observed  patterns 
were  not  representative. 
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5.3  Summary 

We  presented  the  outline  of  a segmentation  procedure.  The  details  provided  are  of 
interest  only  to  those  who  may  be  involved  with  speech  segmentation.  However,  the 
general  scheme  should  be  of  interest  to  others.  Speech  may  be  separated  into  broad  class 
regions  first  --  silence  and  speech,  or  if  voicing  can  be  detected,  silence,  sonorant,  and 
fricated  speech.  These  regions  require  different  types  of  segentation  activity.  However, 
most  of  that  activity  can  be  based  upon  a simple  detection  process  with  very  good  results. 
We  have  introduced  some  rather  od  hoc  methods  of  training,  which  still  work  well.  One 
reason  for  this,  is  that  the  choice  of  detectior.  thresholds  is  guided  as  much  by  the 
requirements  of  the  speech  systemas  regards  the  missing-extra  trade-off.  Thus,  the 
tuning  of  thresholds  is  a job  for  the  system  designer. 
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Chapter  6 

Segmentation  Performance 

In  this  chapter,  we  will  deal  with  some  of  the  problems  involved  In  evaluating 
segmentation  accuracy.  The  difficulty  of  obtaining  a correct  standard  for  comparison  with 
If'e  machine  determined  boundaries,  the  nature  of  various  types  of  discrepancies  between 
the  machine  boundaries  and  the  standard,  and  a normalized  measure  of  accuracy  of 
segment  detection  which  is  derived  from  s'gnal  detection  theory  will  all  be  discussed.  We 
will  also  present  the  results  of  segn'*  ting  continuous  speech  employing  the  four 
parametric  representations  chosen  and  described  i.priier,  ZCC,  SPG,  ASA,  and  ACS  . 
Finally,  some  experiences  with  the  use  of  the  segmentation  algorithms,  as  well  as  case 
analyses  of  particular  interesting  "errors"  are  included. 

6. 1 Evaluating  Segmentation  Errors 

Errors  in  segmentation  of  continuous  speech  must  be  considered  in  the  light  of 
reasonable  expectations.  If  a segment  boundary  is  actually  indicated  in  the  acoustic  Input, 
then  it  ought  to  be  detected.  In  a like  manner,  indications  of  boundary-llke  change  that  do 
not  correspond  to  phonetic  boundaries  should  be  accepted  as  legitimate  results  of 
segmenting  without  higher  level  knowledge  sources.  Alternatively,  boundaries  that  are  not 
indicated  by  some  change  in  the  acoustic  signal  (such  as  the  nasal/nasal  juncture  in  /some 
milk/)  should  obviously  not  be  expected  from  a segmentation  procedurr  which  only 
examines  the  acoustic  parameters.  Another,  equally  important,  consideration  in  evaluating 
a segmenter  is  the  effect  of  its  errors  on  the  overall  speech  understanding  system.  Some 
speech  understanders  are  more  sensitive  to  missing  segments  (boundaries)  while  others 
will  handle  these  well,  bi'*  become  overloaded  if  too  many  extra  segments  are  "detected". 

If  the  above  statements  are  taken  as  a definition  (or  description)  of  the  Kind  of 
discrepancy  between  standard  and  test  transcriptions  which  we  wish  to  consider  an  error, 
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then  a problem  arises  with  respect  to  the  usual  structure  of  hand  transcriptions  of  1 

continuous  soeech.  The  transcription  process  is  very  often  one  of  successive  refinement 
and,  consequently,  descent  through  the  "levels"  of  linguistic  elements.  First  the  words  are 
considered,  and  approximately  fit  to  strong  features  in  the  signal.  Then  an  attempt  is 

I ^ ' 

made  to  fit  the  .’andard  phonemic  spellings  of  each  word,  with  corrections  being  made 
whenever  phonological  or  phonetic  interactions  are  detected.  Finally,  If  the  transcription 
is  at  a sufficiently  fine  level,  the  short-term  iccustic  nature  of  the  signal  Is  used  to  infer 
sub-phonemic  segments  (sometimes  duo  to  co-articulation  effects)  as  well  as  phenomena  I 

which  might  not  be  explained  by  accepted  models  of  speaker  or  language,  but  which  can 
j be  justified  by  the  acoustic  data.  The  transcription  may,  therefore,  have  segment  ! 

boundaries  which  are  justified  by  strong  or  weak  acoustic  cues,  by  phonemic  expectation,  I 

by  phonological  or  phonetic  rules,  or  by  any  combination  of  these  factors.  Thus,  when  the 
machine  segmentation  misses  such  a boundary,  we  must  determine  what  kind  of  a i 

boundary  it  is  to  determine  whether  an  error  has  been  made.  Likewise,  when  we  have  i 

marked  an  extra  segment  *^rror,  we  must  be  sure  that  there  is  really  no  justification  for  a | 

j boundary  at  that  point  in  the  utterance.  We  are  often  not  in  control  of  the  source  of  the  , 

1 hand  transcriptions  used  for  evaluation  standards.  These  may  also  be  used  for  other 

purposes,  and,  since  they  are  costly  to  acquire,  may  have  to  include)  the  kinds  of 
information  just  mentioned. 

In  a paper  describing  their  segmentation  and  classification  evaluation  system  [Sil75], 

Silverman  and  Dixon  discuss  the  difficulty  of  acquiring  standard  (referent)  transcriptions  of 
continuous  speech.  This  difficulty  is  compounded  by  having  different  sots  of 
phonemic/phonetic  elements.  For  example;  /ch/  versus  /t//sh/,  /t/  versus  /-//t<burst>/, 
or  /el/  versus  /e//I/.  Their  philosophy  appears  to  be  similar  to  ours  In  that  they  consider 
sub-phonemic  segments  when  evaluating,  but  only  insist  that  phonemic  segment  not'  be 
missed.  They  also  collect  statistics  on  missed  and  extra  segments,  with  the  addition  of 
separate  statistics  on  misplaced  boundaries.  These  are  defined  by  specifying  alternate 
transition  and  steady  state  regions  and  declaring  segments  to  be  properly  placed  if  they 
; fall  within  the  appropriate  region.  Segmentation  errors  can  be  reported  Individually  for 

' each  phonetic  label. 
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In  her  dissertation  involving  a new  acoustic  analysis  method  [BeKJM75],  BaKer 
evaluates  the  performance  of  five  segmentation  programs  on  a small  set  of  sentences  (5 
sentences  --  about  200  segments).  Her  evaluation  was  performed  by  hai>d  end  extremely 
careful  measurements  were  made  of  the  mislocations  In  time  for  various  classes  of  speech 
sounds,  as  well  as  the  usual  missing  and  extra  counts  Her  philosophy  concerning  what 
boundaries  are  important  agrees  very  well  with  ours,  as  well  as  Dixon  and  Silverman’s. 
Thus  we  may  compare  all  the  segmentation  results  from  these  two  sources  with  our  own 
and  each  other.  The  following  caveat  must  be  offered,  however.  A number  of  the  routines 
tested  by  Baker  were  in  the  early  stages  of  development.  In  addition,  the  quantities  of 
data  are  very  small  to  draw  any  far  reaching  conclusions.  Finally,  some  way  of  normalizing 
for  different  decision  criteria,  which  lead  to  large  amounts  of  trade-off  between  extra  and 
missing  segments,  must  be  applied.  The  following  section  on  a signal  detection  model  will 
provide  such  a normalization.  The  Results  section  of  this  chapter  presents  both  our  ;?wn 
segmentation  performance  and  what  we  feel  is  an  accurate  Interpretation  of  the  other 
reported  performances. 

One  way  to  rectify  the  problem  of  acquiring  good  standard  segments  is  to  use  two 
transcriptions  (possibly  derived  from  one  another).  One  suould  contain  all  the  segments 
that  might  ever  be  justified  (the  union  of  the  various  descriptive  levels).  The  other  should 
contain  the  segments  that  are  both  acoustically  and  phonemically  justified  (the 
Intersection).  A set  of  segmentation  boundaries,  M,  are  evaluated  by  comparison  with 
these  two  standard  transcriptions,  HI  and  H2,  In  the  following  way.  (See  figure  6.1) 
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Given  a margin  of  admissible  (rror^,  an  EXTRA  segment  boundary  is  recorded  if 
there  exists  a boundary  in  M at  time  t but  none  in  either  HI  or  H2  in  fne  Interval  [t- 
margin.t+margin].  Similarly,  a MISSING  segment  is  recorded  if  there  are  boundaries  in  HI 
at  tl  and  H2  at  t2,  |tl-t2|<margin,  and  there  is  none  in  M in  the  interval  [mnift l,t2)- 
margin.max(tl,t2)+margin].  Obviously,  when  H1-H2  this"  becomes  an  obvious  comparison  of 
two  sets  of  segment  boundaries. 

This  technique  has  allowed  us  to  use  hand  transcriptions  which  are  actually  rather 
inappropriate  for  segmentation  performance  evaluation.  However,  there  are,  in  any  corpus 
of  continuous  speech,  a number  of  segments  which  may  be  best  described  as  transients. 
Unless  the  hand  transcription  has  some  indication  of  these,  their  detection  by  a machine 
segmenter  will  show  up  as  extra  segment  boundary  errors.  While  many  such  transient 
segments  are  indicated,  a careful  inspection  of  the  extra  segment  errors  discovered  about 
300  additional  segments  in  a corpus  of  about  1000  primary  segments  which  were  not 
originally  marked  in  even  the  careful  hand  segmentation  we  have  available  for  this  corpus. 
The  number  of  EXTRA  segments  is  usually  not  as  critical  to  system  performance  as  the 
number  of  MISSING  segments.  However,  each  serves  at  least  as  a counter-measure  to  the 
other.  We  will  observe,  in  the  following  section  on  signal  detection  that  the  response 
characteristics  of  a subject  in  a series  of  detection  trials  will  trade  off  correct  positive 
with  correct  negative  responses.  Similarly,  one  can  tune  the  segmentation  aigorithma  to 
deliver  many  more  segments,  thereby  increasing  the  number  of  EXTRA  and  decreasing  the 
number  of  MISSING  segment  boundaries.  So  the  two  statistics  must  be  considered  in 
conjunction  to  determine  the  detectability  of  segment  boundaries. 

An  additional  test  was  added  to  the  evaluation  routine  described  above.  This  test 
checks  for  pairs  of  MISSING  and  EXTRA  errors  which  1)  are  close  together  (e.g.  s30  ms.) 
and  2)  have  no  intervening  hand  segment  boundaries  indicated  In  HI  or  H2.  These  pairs 
may  be  taken  together  as  cases  of  misplaced  boundaries  since,  if  there  were  any  other 
significant  phonetic  change  in  that  region,  it  would  be  Indicated  in  the  hand  segmentations. 


t Wo  are  using  a margin  of  10ms. 
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This  correction  to  the  statistics  is  made  after  the  entire  utterance  is  examined  for  missing 
and  extra  boundaries. 

A final  point  concerning  evaluation  of  segmentation  is  the  place  occupied  by  hand 
evaluation.  We  have  had  to  roly,  for  some  of  the  evaluation  fidelity,  upon  hand  inspection 
of  the  waveforms.  This  has  been  necessary  to  determine  how  critical  the  errors  may  be 
and  wh»*t.ier.  In  fact,  they  are  errors  at  all,  rather  than  mistakes  In  the  standard 
transcription  or  cases  of  non-acoustic  boundaries  with  no  acoustic  correlates  for  missing 
segment  errors,  or  the  reverse  for  extra  segment  errors.  This  is  especially  Important 
because  of  the  low  percentage  of  boundaries  which  are  missed,  and  the  consequently 
large  effect  of  each  incorrect  evaluation  decision.  The  MISSING  segment  errors  are  typed 
as  Type  2 if  they  are  critical  to  recognition  or  if  the  acoustic  cues  are  dearly  present  and 
there  is  a significant  phonetic  juncture  In  the  area.  Type  1 represents  less  critical 
boundaries.  Often  the  lack  of  a segment  may  be  explained  by  some  reasonable  phonetic 
theory  of  the  speaker’s  performance.  In  other  cases  the  boundary  has  been  detected,  but 
at  a different  point  in  time.  Often,  with  slow  transitions,  the  exact  time  location  of 
boundaries  is  impossible.  Type  0 errors  are  usually  cases  where  the  standard 
transcription  is  in  error  or  is  not  acoustic  in  nature.  These  are  boundaries  which  we 
cannot  expect  the  segmenter  to  detect.  In  the  case  of  extra  segment  boundaries,  we  have 
marked  as  type  1 those  boundaries  which  appear  to  have  no  acoustic  validity  upon 
inspecting  the  waveform.  Type  0 EXTRAS  are,  again,  places  where  the  standard 
transcription  is  a non-acoustic  description,  at  best.  The  cases  presented  In  Appendix  SI 
may  serve  to  Indicate  the  kind  of  problems  one  will  encounter  in  dealing  with  continuous 
speech  - no  matter  how  carefully  one  thinks  the  standard  transcription  has  been  prepared. 
They  will  also  give  an  idea  of  the  performance  of  the  segmentation  algorithms  which  we 
are  using.  A subset  of  the  speech  corpus,  presented  as  oscillograms,  with  the  standard 
transcription,  is  included  as  Appendix  S2.  Marked  on  this  display  are  the  segmentation 
errors  for  one  run  (SPG  parameters,  speaker  CC)  and  their  type  vU.  the  previous 
discussion. 
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6.2  A Signal  Detection  Measure 

In  this  section,  we  will  present  the  basic  mathematical  model  of  Signal  Detection 
Theory  and  discuss  its  application  to  the  evaluation  of  machine  segmentation  of  speech. 
The  parameter  d’  is  a measure  of  the  inherent  detectability  of  the  signal  versus  noise 
which  is  independent  of  any  decision  process  application  of  costs  or  a priori  likelihoods. 
We  have  empirical  evidence  that  the  model  fits  the  actual  performance  of  our  segmentation 
algorithms  for  at  least  one  parametric  representation  over  a wide  range  of  detection 
thresholds. 

The  theory  of  Signal  Detection  as  formulated  by  Tanner  et.  el.  [Tan64,  Llc64]  Is 
primarily  applied  to  detection  trials  which  may  be  considered  similar  to  the  segmentation 
process.  Detection  trials  consist  of  a series  of  responses  to  stimuli  which  may  be 
composed  of  noise  or  of  noise  and  some  known  signal  — not  unlike  the  decision  process 
resulting  in  the  placement  of  a segment  boundary  based  upon  local  Information  only.  It  Is 
assumed  that  the  a priori  likelihoods  and  costs  of  various  errors  are  Known  to  a decision 
process  which  senses  and  possibly  transforms  the  stimulus  into  some  Internal  signal  space 
before  it  yields  an  optimal^  decision  on  the  presence  of  the  signal.  The  detector’s  sensory 
data  is  considered,  in  this  model,  to  be  reduced  to  a single  decision  parameter.  A 
reasonable  one  might  be  the  ratio  of  the  probabilities  that  the  Input  stimulus  was 
signal-«-noise  versus  noise  alone.  A simple  threshold  on  this  single  parameter  may  be 
placed  to  optimize  the  expected  costs  given  a priori  likelihoods,  costs  of  mlsseS)  false 
alarms,  etc.  Figure  6.2  represents  such  a hypothetical  internal  decision  parameter. 

Very  simply  stated,  the  model  assumes  a single  decision  parameter,  x,  which  may  be 
any  sensory  measurement  one  wishes.  The  distribution  of  x values  for  the  two  types  of 
stimuli,  signal+noise  and  noise  alone,  are  assumed  to  be  normal  (with  equal  variance  In  the 
simplest  version  of  the  model).  Their  means  differ  by  d’  times  the  standard  deviation. 

t in  a decision  theoretic  sense,  given  the  a priori  Knowledge  about  the  test 


Rates  of  "hit"  and  "false  alarm"  ->  Pr{accept|signal}  and  Pr{accept|noise}  respectively  " 
are  sufficient  to  determir>e  the  least  d*  for  which  an  optimal  decision  process  can  display 
the  observed  rates.  When  the  hit  and  false  alarm  rates  a”e  plotted  against  one  another 
for  a number  of  sets  of  trials  where  the  detector’s  acceptance  threshold  has  been  altered, 
a response  operator  characteristic  curve  is  obtained  (see  figure  6.3). 


The  theory  states  that  the  curve  is  totally  determined  by  d’.  When  the  axes  of  the 
ROC  curve  are  transformed  by  the  inverse  function  of  the  Normal  distribution  function,  the 
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curve  is  approximaLly  a straight  line  with  slope“sigma(noise)/sigma(signal)  and  x- 

t 

intercept»d’. 

This  theory  has  been  most  often  applied  to  detection  trials  to  provide  estimates  of 
the  detectability  of  the  signal  as  it  appears  in  a human  perceiver’s  internal  sensory  signal 
space.  The  estimate  provided  by  the  signal  detection  model  may  then  be  compared  with 
well  known  properties  of  visual  or  auditory  signals  to  provide  a bound  on  the  efficacy  of 
the  perceiver’s  transduction  process  --  the  sensory  channel.  While  the  main  thrust  of  its 
application  is  not  relevant  here,  the  signal  detection  model  and  the  dimensionless  measure 
d’  can  be  ufeed  as  a normalized  measure  of  segment  boundary  detection  that  is  relatively 
unaffected  by  adjustments  in  the  proportion  of  missing  versus  extra  segment  errors. 
Furthermore,  the  d’  value  once  estimated  may  be  used  to  predict  the  entire  response- 
operator  characteristic. 

Following  the  procedures  shown  in  Egan  et.al.  [Ega64],  a series  of  segmentation  runs 
were  made  with  the  ZCC  parametric  representation  for  a set  of  40  utterances  (TAP)  In  the 
AP  news  retrieval  task  domain  with  one  speaker.  These  runs  were  to  investigate  a range 
of  detection  responses  by  varying  the  thresholds  used  internally  in  the  segmentation 
algorithm  (see  Chapter  5).  The  resultant  error  rates  may  be  seen  in  figure  6.4  below 
plotted  on  inverse  No'mal  axes.  A linear  least-squares  fit  to  the  points  yielded  a slope  of 
1.000  and  a d’  value  ot  2.250.  We  will,  henceforth,  assume  that  the  simple  model  with 
equal  variances  gives  a good  estimate  of  the  performance  of  the  segmentation  algorithm 
we  are  testing.  The  d’  values  reported  below  will  be  derived  from  that  model. 

Finally,  a confidence  interval  was  calculated  for  the  d’  statistic,  under  the  assumption 
*hat  it  is  approximately  normally  distributed  for  any  particular  experiment.  The  95^ 
interval  for  the  SPG  experiment  was  +.14  In  d’.  We  shail  see  that  this  is  considerably  less 
than  the  differences  in  d’  observed  among  parametric  representations. 


6.3  Results 

As  we  mentioned  in  Chapter  5,  the  segmentation  procedures  used  here  were 
developed  for  use  with  the  Hearsay  II  speech  understanding  system.  In  Keeping  with  the 
philosophy  of  that  system  — the  separation  of  Knowledge  about  speech  Into  individual 
modules  or  Knowledge  sources  — we  employ  no  phonetic  6r  phonological  Knowledge  to 
correct  segmentation  decisions.  Indeed,  we  do  not  even  use  the  labeling  information  to 
join  similarly  labeled  adjoining  segments  at  this  stage.  It  is  difficult,  therefore,  to  compare 
our  results  directly  with  other  segmentation  and  classification  schemas  which  interact 
closely  to  produce  a transcription  at  the  phonetic  or  phonemic  level.  What  we  have  done, 
however,  is  to  carefully  inspect  the  errors  made  by  our  segmenter  with  the  SPG 
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parameters  as  the  input  representation.^  Then  a new  reference  segmentation  was 
produced  by  adjusting  the  oid  one  to  aii  of  the  type  0 errors  (those  cases  where  the  hand 
segmentation  was  in  error,  or  the  machine  segmentation  agreed  with  the  acoustic  signai). 

The  foilowing  tabie  (Figure  6.5)  shows  the  resuits  of  segmentation  for  the  40 
sentences  from  the  News  Retrievai  task  (TAP),  spoken  by  CC  (maie,  American). 
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Figure  6.5:Segmentation  Performance  — Different  Parametric  Representations 


The  first  reference  segmentation  contains  1082  segments  primariiy  at  the  phonemic 
ievei  of  description.  The  second  reference  contains  corrections  to  this  fiie,  as  described 
above,  to  make  it  more  an  acoustic  description  of  the  corpus.  It  has  1541  segments.  The 
number  of  machine  segments  »-eported  may  be  greater  than  the  sum  of  this  number  (hand 
reported  acoustic  segments)  and  the  number  of  extra  boundaries.  The  discrepancy  is 
mereiy  the  resuit  of  the  way  we  evaluate  segmentation  by  boundaries.  Occasionally,  two 
machine  boundaries  will  fail  close  enough  to  a hand  boundary  to  both  bo  accepted.  Such 
segments  must,  therefore,  be  very  short,  and  are  usually  t/ansltion  segments  which  may 


t Decisions  were  made  from  inspection  of  the  waveforms  only.  The  results  of  this  hand 
evaluation  are  Included  as  Appendix  S2. 
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be  easily  detected  at  higher  levels.  The  number  of  missing  boundaries  (segments),  divided 
by  the  number  of  boundaries  which  are  included  in  both  reference  segmentations,  is  the 
missing  segment  error  rate.  The  number  of  shifted  boundaries  is  also  divided  by  this 
number.  The  numbet  of  extra  boundaries  is  divided  by  the  number  of  primary  segments 
(the  size  of  HI  in  this  case).  The  extra  segment  rates  in  parentheses  are  those  where 
division  is  by  the  number  of  acoustic  segments  (size  of  H2).  Finally,  d',  as  determined  by 
the  missing  and  extra  segment  error  rates  is  calculated  from  the  equal  variance  model 
discussed  in  the  last  section  of  this  chapter. 

In  d',  we  can  see  a clear  decrease  in  "detectability"  of  segment  boundaries  which 
agrees  well  with  what  we  would  suspect  about  the  information  content  of  the  four 
representations. 

A second  set  of  results  were  obtained  over  two  data  sets  for  which  no  such 
carefully  compiled  hand  segmentations  are  available.  (See  Figure  6.6)  They  can,  however, 
be  compared  with  the  machine  segmentation  results  just  presented,  as  a demonstration  of 
the  robustness  of  the  algorithms  used,  and  thus  the  validity  of  evaluation  made  with  them. 
Wo  employed  the  ZCC  representation  for  this  experiment  because  of  its  availability, 
although  any  of  the  parametric  representations  would  have  provided  just  as  valid  results. 
In  this  evaluation,  we  used  only  the  primary  phonetic  level  hand  segmentation.  The  results 
of  comparison  with  this  not  necessarily  acoustic  description  of  the  data  will,  obviously,  be 
inferior  to  those  presented  above.  We  will  reevaluate  the  ZCC  segmentation  from  that 
experiment  in  a similar  manner. 

The  additional  sets  of  utterances  are  drawn  from  two  different  task  domains  with 
much  less  restricted  vocabularies  and  grammars  and  are  spoken  by  two  different  male 
American  speakers.  The  Aliophone  (LAL)  sentences  are  27  general  English  sentences 
designed  to  contain  a wide  variety  of  the  commonly  occuring  aiiophones  of  that  language. 
[Sho74b]  The  Strain  set  (BTR)  consists  of  55  sentences  drawn  from  seven,  more  restricted 
grammars  and  task  domains.  [BakJK75b]  In  the  following  tabh),  the  somewhat  poorer 
performance  shown  for  BTR  is  possibly  attributable  to  the  different  method  of  hand 
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Figure  6.6:  Segmentation  Performance  --  Other  Speakers  and  Tasks 


segmentation  used  to  produce  the  reference  segmentation.  In  this  case,  a variant  of  the 
Dragon  speech  understander  [BakJK75c]  was  used  to  fit  standard  lexical  transcriptions  to 
the  signal  m a sense  defined  by  the  Dragon  model  to  be  optimal.  These  were  hand 
corrected  to  some  extent,  but  by  a different  person  than  the  transcriber  of  the  rest  of  the 
data.  It  has  not  been  possible  at  this  point  to  correct  this  discrepancy.  However,  the 
experience  we  have  had  on  a wide  range  of  data  sets  Is  of  the  robustness  of  the 
segmentation  procedures.  The  excellent  agreement  between  LAI.  and  TAP  Is  found  in  spite 
of  the  fact  that  thresholds  for  the  segmenter  were  derived  from  another  corpus,  spoken 
by  the  speaker  of  TAP,  but  the  utterances  of  LAL. 


Finally  in  Figure  6.7,  comparison  can  be  made  between  the  first  set  of  results  and 
previously  reported  segmentation  performance  as  given  in  Baker  and  Dixon  and  Silverman. 
[BakJM75,  Dix75a] 

A note  concerning  interpretation;  A careful  inspection  of  Baker’s  results  showed 
that  secondary  boundaries  increase  the  total  number  of  hand  segments  to  about  370.  We 
have  not  generally  provided  that  detailed  a hand  transcription  and,  thus,  may  be  reporting 
as  extra  boundaries  some  legitimate  detect'ons  of  secondary  segments.  Secondly, 
Silverman  and  Dixon  do  not  report  the  sources  of  knowledge  used  to  produce  the 
segmentation  results  we  have  quoted.  It  is  our  impression  that  some  amount  of  label  and 
phonetic  rule  information  is  used  to  improve  the  segmentation. 
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6.4  Discussion 

Although  the  difficulty  of  reaiiy  accurate  comparison  between  different  segmentation 
programs  would  seem  tc  preclude  drawing  firm  conciusions  from  the  last  set  of  results 
concerning  their  relative  merits,  it  is  fairly  clear  that  progress  is  being  made.  Certainiy 
some  programs  may  be  better  at  detecting  and  locating  certain  types  of  boundaries.  At 
this  point,  only  careful  comparison  of  system  errors  can  provide  those  insights.  However, 
the  major  result  we  w«sh  to  put  forward  is  concerned  with  the  parametric  representation 
dimension.  There  is,  indeed,  a measurable  improvement  as  one  goes  to  more 
informationally  complete  representations  of  the  signal.  However,  that  improvement  may 
not  be  so  critical  to  system  performance  as  to  justify  increased  computation  or  hardware^ 
costs.  If  higher  level  Knowledge  can  effectively  cope  with  2 or  3 timet  the  number  of 
extra  segments,  then  we  could  Keep  the  number  of  missing  segments  constant  and  go  from 
an  LPC/DFT  computation  for  SPG  to  the  six  analog  filters  of  ZCC.  At  the  ZCC  value  for  d’ 
(1.29),  a missing  segment  rate  of  AZ  corresponds  to  an  extra  segment  rate  of  68Z  (versus 
287.  for  SPG).  Whether  or  not  he  is  willing  to  maKe  such  trade-offs  is  entirely  up  to  the 
system  designer. 


t Computationally  expensive  but  straightforward  representations  may  be  handled  rapidly 
with  the  use  of  special  purpose  hardware  at  a loss  of  generality  and  an  increase  in  initial 
costs. 
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It  has  been  our  experience  in  working  wl^h  the  segmentation  algorithms  developed 
at  C-MU,  that,  although  improvements  in  performance  have  been,  and  will  certainly 
continue  to  be  made,  this  "snapshot"  of  the  parametric  representation  dimension  Is  valid 
for  some  time  to  come.  We  have  yet  to  see  promise  of  any  completely  parametric  level 
solution  to  the  segmentation  problem,  This  should  be  obvious  upon  considering  the  highly 
variable  nature  of  continuous  speech  and  the  variety  of  kinds  of  phenomena  to  be  dealt 
With  in  segmenting  any  single  utterance.  Thus  UP  (Baker)  parameters  may  be  effective  in 
locating  short  burst  types  of  segments  while  LPC  models  of  the  resonant  structure  may  be 
superior  for  long,  voiced  segments  and  sonorant/sonorant  boundaries. 

In  the  final  analysis,  !M^,.i.ction  of  specific  cases  gives  the  kind  of  qualitative  insights 
that  are  also  needed  to  predict  performance  in  a total  system.  Particular  kinds  of  errors 
may  be  very  costly  if  they  lead  down  wrong  paths  in  the  search,  but  only  detailed 
understanding  of  particular  systems  can  identify  such  cases.  In  Appendix  SI,  we  have 
tried  to  identify  some  of  the  different  situations,  by  presenting  cases  of  discrepancy  with 
the  hand  segmentation.  Where  the  hand  segmentation  is  correct,  other  sources  of 
knowledge  must  be  able  to  override  the  segmenter’s  mistake.  Where  the  machine  segment 
best  fits  the  acoustic  signal,  higher  level  knowledge  must  understand  such  cases  of 
variation  from  expectation, 

6.5  Summary 

In  this  chapter,  we  have  described  some  of  the  problems  encountered  In  making  a 
fair  evaluation  of  segmentation.  The  major  problem  is  acquiring  a hand  transcription  of  the 
correct  level  of  representation.  Our  approach  is  to  use  two  referent  segmentations  at 
both  the  acoustic  and  phonetic  level.  A model  of  signal  detection,  derived  from  existing 
theory,  gives  a useful  measure  of  detectability  which  normalizes  for  missing-extra  trade- 
off. The  resultant  evaluations  show  a clear  preference  for  high  information 
representations.  The  SPG  and  ACS  parameters  contain  a complete  model  of  th«  resonant 
structure  of  the  vocal  tract  imp'''se  response.  The  ASA  and  ZCC  are  approximations  to 
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this  information.  The  first  two  contain  more  than  500  bits  of  information  per  window,  the 
latter  two,  less  than  100.  Finally,  we  have  compared  the  results  obtained  '"'ith  our  very 
low-ievel  segmentation  routine  to  other  results  reported  In  the  literature.  We  feel  this 
rou'ine  performs  quite  satisfactorily.  With  the  Improvements  of  phonetic  and  phonological 
knowledge  used  by  other  programs,  it  should  perform  at  least  as  well  as  them. 
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In  this  chapter,  we  discuss  the  labeling  procedure  adopted  for  thi?  research.  An 
important  issue  to  be  faced  is  the  choice  of  segmenting  or  labeling  first,  and  whether  the 
two  processes  will  interact.  The  choice  of  distance  metrics  is  also  a primary  one  for  this 
evaluative  study.  Finally,  we  mtroduce  some  simple  parameters  which  relate  to  the 
segmentation;  these  are  prosodic  in  nature.  Training  the  labeler  is  discussed,  and  an 
algorithm  for  clustering  the  training  data  into  natural,  intra-phone  groupings  Is  presented. 


7.1  Role  of  Labeler,  Interface  with  Segmenter 

We  have  observed  that  labeling  and  segmenting  are  often  strongly  Interconnected  in 
many  systems.  Indeed,  the  two  processes  seem  to  be  two  sides  of  the  same  coin.  Similar 
techniques  are  often  used  to  match  input  patterns  with  stored  templates  as  are  used  to 
match  inputs  from  neighboring  time  intervals  for  boundary  detection.  The  choice  of 
whether  to  segment  or  label  first  is,  to  a large  extent,  an  arbitrary  one,  often  based  more 
upon  system  structures  than  upon  the  requirements  of  acoustic  analysis.  However,  since 
we  have  tried  to  separate  the  two  processes  as  much  as  is  realistically  possible  for  the 
comparative  analyses  made,  it  has  seemed  more  sensible  to  segment  first.  By  associating  a 
number  of  input  patterns  with  each  other  in  a single  segment,  this  strategy  allows  one  to 
reduce  the  sheer  number  of  input  pi'tterns  which  one  must  compare  against  a set  of 
stored  templates.  Labeling  second  also  allows  one  to  make  use  of  the  segmentation 
decisions  to  locate  regions  of  least  acoustic  change  in  the  input  signal.  It  Is  not  our 
purpose  to  argue  the  merits  of  any  particular  approach  to  structuring  the  application  of 
these  two  processes;  it  is  assumed  that  segmentation  proceeds  directly  upon  the  acoustic 
input.  Labeling  occurs  at  those  regions  of  the  input  which  segmentation  has  selected  as 
being  relatively  stationary  (e.g.,  pieces  of  vowels)  or  as  being  primitive  acoustic  gestures 
(e.g.,  bursts).  The  recomposition  of  these  segments,  with  the  labels  which  they  will 
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acquire,  into  higher  level  constructs  such  as  stop  consonants,  diphthongs,  or  even 
phonemes  that,  although  ideally  stationary,  were  realized  with  internal  acoustic  variations 
is  a task  for  a different  set  of  algorithms  v/hich  utilize  phonetic  and  phonological 
knowledge.  The  labeling  is  intended  to  be  acoustic-phonetic  --  it  attempts  to  identify  in 
the  input  those  acoustic  patterns  which  occur  during  the  realization  of  the  phonetic 
elements. 

A second  aspect  of  the  labeling  process  has  to  do  with  how  one  measures  correct 
performance.  It  is  generally  considered  Important  for  speech  understanding  systems,  that 
the  correct  label  be  "available"  among  a few  alternative  choices.  Rather  than  considering 
only  the  first  choice,  most  systems  will  use  other  sources  of  knowledge  to  choose  among  a 
few  labels  for  each  segment.  Thus  we  are  not  concerned  that  the  labeler  may  label  a 
segment  with  the  wrong  label,  provided  it  also  reports  the  correct  label  as  an  alternative. 
This  kind  of  requirement  has  implications  mainly  for  the  training  process  which  will  be 
discussed  below.  A si.^ilar  way  of  stating  it  is  to  look  at  the  allophonic  variation  problem. 
Very  often  allophones  of  one  phoneme  are  acoustically  very  similar  to  those  of  another 
phoneme.  Although  these  separate  allophones  represent  the  same  acoustic  state,  we  might 
wish  to  keep  separate  descriptions  of  each  - thus  guaranteeing  that  the  labeler  wilt 
report  the  "correct"  as  well  as  other  phonetic  labels. 

f’Or  both  reasons,  therefore,  we  have  chosen  to  learn  and  keep  as  recognition 
templates ' the  acoustic  patterns  of  a number  of  variants  of  each  phonetic  label.  The 
method  of  acquisition  of  the  template  set  and  the  patterns  will  be  discussed  when  we 
present  the  clustering  method  below. 

7.2  Choice  of  Metrics 

In  the  survey  of  pattern  classification  techniques,  a number  of  distance  metrics  were 
described,  and  the  obvious  and  central  role  of  the  distance  concept  was  discussed. 
Distance  in  the  pattern  space  occupies  just  such  a central  role  not  only  In  the  pattern 
recognition  but  also  in  the  template  training  methods.  Thus,  we  will  briefly  restate  some 
of  the  features  of  the  set  of  metrics  chosen  for  these  experiments. 


I 
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Euclidean  distance  (EUC)  and  Correlation  (COR)  are  each  functions  of  two  pattern 
vectors.  Euclidean  distance  serves  to  draw  spherical  loci  of  equal  distance  around  any 
point  in  the  sphere.  The  loci  of  equal  distance  around  a point  for  Correlations  are  cones 
with  the  vertex  at  the  origin.  Correlation  may  also  be  thought  of  as  Euclidean  distance  In 
the  two  dimensional  space  of  the  surface  of  the  sphere  around  the  origin.  The  decision 
boundaries  drawn  between  any  two  template  points  are  hyperplanes  (perpendicular 
bisectors  of  the  connecting  segment  for  EUC,  through  the  origin  for  COR).  Although  one 
could  consider  Euclidean  distance  to  be  much  more  powerful  in  capturing  relationships  in 
the  pattern  space,  Correlations  do  serve  to  absolutely  normalize  out  any  scalar  terms 
(such  as  amplit'jde  from  a set  of  filter  band  parameters). 

When  second  moment  data  about  the  templates  is  available,  such  as  variance  of  the 
parameters  within  each  phonetic  label  class,  or  covariance  (overall  or  label-specific),  more 
complex  distance  metrics  give  somewhat  improved  results  at  the  cost  of  more  computation. 
Standard  Deviation  weighted  Euclidean  distance  (SIG)  normalizes  each  term  of  the 
Euclidean  distance  by  the  variance  In  that  dimension.  Its  loci  are  ellipsoids  with  axes 
parallel  to  the  dimensional  axes.  Finally,  if  covariance  information  is  available  for  each 
label  class,  we  can  use  the  quadratic  form  to  draw  boundaries  of  general  quadratic 
surfaces.  This  is  the  Maximum  Likelihood  metric  (LIK)  which  assumes  general  Gaussian 
distributions  o'  the  classes  end  assigns  the  input  to  the  class  most  likely  to  have  produced 
it. 

Two  other  distance  metrics  have  been  chosen.  The  Itakura  (ITK)  metric  is  based 
upon  the  linear  prediction  theory  and  is  the  estimate  of  the  least  squares  error  term  when 
one  interval  is  predicted  by  the  LPC  model  of  another.  The  motivation  for  this  type  of 
measure  is  different  from  the  previously  described  geometric  partitionings  of  a pattern 
space;  however,  it  is  included  in  the  investigations  in  so  far  as  it  can  be  applied  within  the 
algorithms.  Finally,  the  Baker  log  probability  estimate  (BAK)  is  an  ad  hoc  estimate,  based 
upon  the  Euclidean  distance  in  a normalized  version  of  the  ZCC  parameter  space,  which  is 
of  particular  interest  because  of  the  relatively  good  results  which  have  been  obtained  in 
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the  Dragon  system  using  it  (even  though  the  ZCC  parameters  are  fairly  crude  by  current 
standards).  [BakJK75b,  Low76] 

7.3  Prosodic  Features 

The  labeler  has  available  to  it  the  segmentation  decisions,  and  could  use  these  to 
some  benefit  it  segmental  information  is  known  about  the  training  data.  Thus,  an  additional 
aspect  of  pattern  matching  or  distance  computation  which  has  been  useful  is  a comparison 
of  what  may  be  called  the  prosodies  of  the  entire  segment  --  as  opposed  to  the  acoustics 
of  the  center.  We  have  chosen  three  rather  simple  parameters  which  measure  the  broad 
nature  of  each  segment; 

1)  The  average  amplitude  of  the  signal  over  the  segment} 

2)  The  duration  of  the  segment; 

3)  The  contour  of  the  amplitude  as  it  compares  to  neighboring 

segments. 

Calculation  of  the  first  two  is  obvious;  the  last  is  merely  given  a value  1,  0,  )r  -1  if  the 
segment  represents  a peak,  intermediate  level,  or  trough  in  average  amplitude, 
respectively.  A particularly  conservative  application  of  this  information  is  made  in 
comparing  two  segments.  A scalar  multiplier  for  the  regular  distance  metric  score  is 
composed  of  the  product  of  three  values  derived  from  these  three  parameters:  1)  the 
ratio  of  the  two  average  amplitudes,  2)  the  ratio  of  the  two  durations,  and  3)  the 
difference  between  the  two  contours.  These  are  applied  to  a {ad  hoe)  function  giving 
values  betwen  1 and  2.  (See  figure  7.1)  Thus  the  minimum  scalar  is  1 and  the  maximum  is 
8.  The  distance  between  input  and  template  will  be  Increased  when  any  of  )hese 
parameters  disagree. 

While  the  effect  of  such  prosodic  matching  is  not  completely  determined  when 
comparing  segments  produced  by  different  segmenters  (hand  versus  machine),  it  has 
proven  valuable  in  clustering,  where  different  duration,  contour,  or  stress  may  be 
corollary  to  allophonic  variation. 
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Weight 


Figure  7.  Ir  Prosodic  Weighting  Function 


In  summary,  a segment  is  labeled  by  computing  the  distance  between  its  center 
pattern  and  each  of  a set  of  tempiate  patterns,  sometimes  using  the  prosodic  sceier  to 
penalize  the  scores  of  very  different  "shaped"  segments.  The  closest  templates  indicate 
the  best  alternative  labels  for  the  segment.  Training  uses  some  of  the  same  metrics. 

7.4  Training,  Cluoter  Acquisition 

In  keeping  with  the  philosophy  expressed  above,  we  have  designed  a training 
procedure  for  label  templates  which  discovers  the  inherent  clustering  of  the  sample 
patterns  in  the  parameter  space.  The  purpose  is  to  identify  the  acoustic  patterns  which 
commonly  occur  during  the  realization  of  each  phone. 

Since  our  only  model  of  acoustics  is  similarity  in  the  pattern  space  of  a particular 
parametric  representation,  the  algorithm  for  clustering  is  based  on  pattern  space  distance 
(and  the  previously  mentioned  segment  oriented  prosodic  scalar). 

A particular  corpus  of  data  has  been  chosen  as  the  source  of  samples  for  training. 
We  have  employed  for  this  purpose  a set  of  27  continuous  speech  utterances  (about  2 
minutes)  designed  to  include  examples  of  most  common  allophones  of  American  English 
phonemes  in  semantically  and  syntactically  correct,  yet  unusual  (and  thus  care-invoKIng) 
sentences.  These  training  sentences  are  recorded  under  similar  conditions  for  each 
speaker  tested.  The  approximately  1700  phonetic  segments  in  this  corpus  have  been  hand 
segmented  and  labeled  to  an  extreme  degree  of  care  and  fineness  of  view.  This 
segmentation  bears  a very  strong  acoustic  flavor  in  that  clear  acoustic  cues,  such  as 
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changes  in  amolilude  or  formant  shifts,  are  taken  to  indicate  separate  sub-phonemic 
segments.  The  transcription  may  be  taken  to  be  as  close  to  an  acoustic  description  of  the 
signal  as  can  be  justified  by  reasonable  phonetic  theory  and  careful  listening  and  study  of 
wave  forms  and  spectrograms. 

For  each  segment,  the  center  10  m.secs.  of  data  yield  a vector  of  parameters  in  the 
chosen  representations,  and  the  segment  boundaries  and  overall  amplitude  parameter  yield 
the  three  prosodic  parameters.  Each  of  40  phonetic  labels  is  represented  by  an  exclusive 
subset  of  these  1700  inout  samples.  For  each  subset,  all  samples  are  read  in  and  the 
complete  set  of  pairwise  distances  and  prosodic  scalars  is  computed.  The  resultant  matrix 
of  distance  (x  scalar)  values  is  then  reordered  in  the  following  fashion: 

First  a threshold  is  chosen  by  computing  the  mean  and  standard  deviation  of  all  the 
entries  in  the  matrix  and  setting  the  threshold  to  equal  MEAN  ♦ C x STANDEV  (where  C is 
usually  -1/2).  This  data-determined  threshold  is  then  applied  to  each  row  of  the  matrix 
and  the  row  with  the  most  elements  within  the  threshold  is  selected  as  indicating  the  first 
template.  (Ties  are  resolved  by  the  least  sum  of  all  below-threshold  elements.)  Those 
samples  within  the  threshold  distance  of  this  identified  sample  are  removed  from  the  set. 
Then  we  iterate,  producing  a second,  third,  etc.,  template  for  the  particular  label,  until  the 
entire  set  is  exhausted.  (See  Figure  7.2  for  the  resultant  matrix  of  pairwise  distances  for 
89  samples  of  schwa.)  The  number  of  samples  supporting  each  template  (the  number  of 
remaining  samples  within  threshold)  is  used  to  discard  "errors"  or  unusual  realizations  by 
discarding  templates  supported  by  less  then  k samples  (usually  2,  3,  or  4 in  these 
investigations). 

Some  interesting  results  have  come  out  of  applying  this  simple  algorithm  In  addition 
to  the  particular  se*s  of  templates.  1)  Errori  in  the  hand  segmentation  and  labeling  were 
clearly  identified  as  poorly  supported  (ofien  by  no  other  sample)  samples  of  the  (errorful) 
label.  2)  While  some  clenriy  allophonic  d.stinctions  were  made,  such  as  de-voicing  In  /r/ 
(as  in  "crude"),  often  clusters  had  greater  correlations  with  such  factors  as  position  within 
a word  (i.e.,  stress).  We  consider  all  variants  of  a label  which  occur  frequently  enough  to 
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Figure  7.2:  Clustering  Matrix  --  Sorted  by  Cluster 
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be  useful  sources  of  templates.  3)  Finally,  even  though  considerably  different  parametric 
reprnsentations  and  distance  metrics  were  used  in  the  clustering  algorithm,  very  small 
variation  was  discovered  in  the  total  number  of  templates,  and  a fair  degree  of  consistency 
was  found  in  the  number  of  templates  for  each  label.  This  seems  to  Indicate  that  the 
inherent  acoustic  variations  are  indeed  being  discovered. 

The  following  chapter  discusses  the  evaluation  of  labeling  performance  and  presents 
the  results  for  a number  of  parametrization/metric  combinations.  The  most  Important 
performance  issue  is  accuracy.  However,  some  attention  is  due  to  requirements  of  storage 
and  computation  time.  The  single  most  important  number  to  these  aspects  of  performance 
Is  the  number  of  parameters  in  the  representation.  Storage  of  templates  Is  linear  with  this 
number  (except  for  covariance  data  which  rises  as  its  square).  Computation  of  EUC,  COR, 
and  SIG  Is  linear,  and  of  LIK  is  as  the  square  of  the  number  of  parameters.  Of  similar 
importance  is  the  number  of  templates.  All  storage  and  computation  requirements  increase 
linearly  with  this  number.  Speed*ups  may  be  effected  by  methods  such  as  partial 
evaluation  of  the  distance  metric  and  discarding  of  a choice  if  the  partial  evaluations 
exceed  some  threshold.  Or,  if  the  distance  to  one  v'»wel  template  Is  too  far,  no  other 
vowels  might  be  evaluated.  A number  of  ways  are  available  of  speeding  up  the  very  time- 
consuming  process  of  evaluating  the  full  set  of  distances  from  the  Input  to  each  template 
for  every  segment.  The  trade-off  between  a smaller  number  of  templates  yielding  a less 
fine  partitioning  of  the  pattern  space,  and  more  templates  with  consequently  higher 
storage  and  computational  costs  is  one  which  must  be  considered  In  the  light  of  what  the 
rest  of  the  speech  understanding  system  expects  from  the  acoustic-phonetic  level  analysis. 
A small  number  of  templates  will  be  costly,  and  indeed,  will  be  "correct"  a higher 
percentage  of  the  time.  Yet  the  information  provided  by  these  fewer  templates  will  be 
less  constraining  to  the  search  for  the  utterance  precisely  because  the  classes  recognized 


are  broader. 
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7.5  Summary 

The  simple  pattern  classification  model  is  chosen,  and  a set  of  basic  distance  metric 
are  apilied  to  it.  The  distances  are  augmented  by  a prosodic  weight  function,  composed  ot 
three  simple  measurements  of  stress  and  duration  of  each  segment.  The  less  in  agreement 
these  parameters  are,  between  the  template  and  the  unknown  pattern,  the  greater  the 
distance  value  will  be.  Training  of  the  labeler  ncludes  discovering  the  inherent  clusters 
within  each  phone  class.  This  process  is,  again,  based  upon  the  distance  metric  and 
pattern  space  measure  of  acoustic  similarity  central  to  this  level. 


no 


Chapter  8 

Labeling  Performance 

In  order  to  properly  evaluate  the  labeling  performance  results,  we  must  first 
consider  some  of  the  issues  associated  with  the  labeling  process  as  we  have  defined  it. 
The  following  section  deals  with  the  problems  of  deciding  1)  what  are  the  objects  we  are 
to  recognize,  2)  what  is  error  and  correctness  In  an  ambiguous  situation,  and  3)  what  Is 
the  effect  of  segmentation  performance  on  the  definition  of  labeling  correctness.  Section 
2 defines  the  types  of  statistics  we  will  present  and  the  experimental  dimensions  we  will 
cover. 

8. 1 Some  Isslos  for  Evaluation 

8.1.1  Recognition  Targets 

In  Section  1 of  Chapter  3,  we  introduced  the  need  to  have  acoustic-phonetic 
elements  as  recognition  targets.  We  deci^d  that,  although  recognizing  phonetic  features 
was  also  a valid  approach  to  the  problem  of  constructing  a phonetic  description  of  the 
signal,  using  individual  target  sounds  (hopefully  with  associated  phonetic  Information) 
would  bo  more  likely  to  provide  the  robustness  needed  by  continuous  speech 
understanding  systems.  In  Chapter  7,  we  presented  a method  uf  cluster  analysis  which 
could  derive  these  acoustic-phonetic  labels  as  acoustic  clusters  and  repreaentative 
templates  with  hand  supplied  phonetic  labels. 

A second  Issue  raised  at  that  time  pertains  lo  the  size  of  the  set  of  recognition 
targets  --  the  fineness  of  the  partitioning  of  the  pfionetic  space.  While  we  do  not  deny 
the  existence  of  such  entities  as  phonemes  within  the  domain  of  higher  level  speech 
Knowledge,  we  have  reached  the  conclusion,  along  with  many  others  [SchR75,  SII75, 
Erm74b,  Red75b],  that  the  acoustic-phonetic  elements  found  In  continuous  speech  belong 
to  an  entire  spectrum  of  «Mch  partitionings.  That  Is,  to  any  degree  of  fineness,  there  will 
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always  be  some  ambiguity  encountered  in  labeling  (and  segmenting)  the  speech  signal,  yet, 
at  any  degree  of  fineness,  a valid  description  of  the  signal  can  be  given.  Thus,  In  choosing 
where  to  fix  our  goals  for  identification  of  sounds,  we  must  make  some  aroltrary  decisions. 
Those  decisions  may,  however,  be  guided  by  ,vhaf  we  know  about  the  Input  requirements 
of  knowledge  sources  at  higr.?r  levels  and  about  common  practice  in  describing  speech  at 
this  low  level. 

The  problem  of  evaluating  labeling  performance  depends  upon  a set  of  decisions 
regarding  1)  the  samples  of  the  signal  used  to  test  --  whose  segmentation  to  employ,  2) 
the  recognition  targets  --  how  fine  a description  of  the  acoustic-phonetic  states  of  the 
signal  to  create,  and  3)  the  criteria  for  correctness.  This  last  decision  must  be  made  in 
terms  of  the  system  which  will  eventually  use  the  labels.  Thus  some  discussion  of  the 
sources  and  types  of  error  in  labeling  is  in  order  if  we  are  to  justify  how  we  present  the 
performance  results  and  how  we  define  correct  behavior. 

8.1.2  Errors 

To  declare  a label  In  error,  we  must  have  at  our  disposal  the  "correct" 
interpretation  of  the  acoustic  signal  for  that  segment.  Since  such  information  is  usually 
provided  by  hand  segmentation  and  transcription,  we  are  faced  with  the  problems  of  level 
of  description  and  of  human  error  similar  to  those  discussed  in  Chapter  $ for  hand 
segmentations. 

It  is  clear,  for  example,  that  we  should  not  expect  the  output  of  even  a totally 
correct"  labeler  to  match  the  phonemic  content  of  the  utterance,  even  ignoring 
discrepancies  introduced  by  acoustic  rather  than  phonemic  segmentation  to  define  the 
location  of  labeling  activity.  Vowels  will  be  affected  by  context  — rounding  caused  by 
velerization,  centralization  caused  by  lack  of  stress.  Transitions  from  /z/  hj  /$/  caused  by 
gradual  loss  of  voicing  in  final  /z/  is  another  commonly  occuring  discrepancy. 

If  there  were  a clearly  accepted  set  of  phonetic  rules  to  explain  such  deviations 
from  the  standard  phonemic  expectation  (and  assuming  phonoiogicai  variation  was  aiso 
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handled,  perhaps  integrated  into  the  lexicon),  an  automatic  evaluation  could  be  used  with 
confidence  that  the  errors  found  were  truly  errors  of  the  acoustic-phonetic  level. 

We  are  forced,  rather,  to  rely  upon  a careful  hand  transcription,  where  the 
segmenter/labeler  has  taken  seme  effort  to  provide  an  acoustically  valid  description  of  the 
signal.  In  spite  of  its  inherent  errors  and  misrepresentations,  such  a transcription  is  the 
best  representation  of  the  acoustic-phonetic  situation  that  is  currently  available.  We  can 
only  try  to  make  up  for  incorrect  reference  labels  by  providing  a few  degrees  of  fineness 
in  the  sets  of  targets  for  which  performance  is  reported. 

8.1.3  Segmentation 

We  have  already  raised  the  problem  of  segmentation  and  its  effect  upon  labeling. 
While  we  cannot  have  a perfect  segmentation  for  reference,  any  more  than  a perfect 
labeling,  we  can  improve  the  correspondence  of  our  reference  segmentation  with  acoustic 
reality  by  using  a finer,  corrected,  hand  segmentation.  In  many  Cjsos,  acoustic  segments 
(with  phonemic  labels  at  times)  will  direct  the  machine  labeler  to  the  relevant  portions  of 
the  signal.  Sonorant  segments  will  be  composed  of  more  or  less  steady  state  sub- 
segments,  which  are  the  places  where  we  can  expect  the  best  labeling  performance.  Their 
locations  are  not  always  available  at  the  phonemic  or  phonetic  level. 

The  procedure  we  have  adopted  is  to  acquire  the  best  hand  segmentation  possible 
at  the  lowest  level  of  description  possible,  Then  a corrected  segmentation,  such  as  the 
one  refered  to  in  Chapter  6 and  Appendix  S2,.  is  merged  with  this  transcription^.  The 
"correct"  label  for  these  segments  is  considered  to  be  the  hand  label  which  was  in  effect 
at  the  middle  of  the  acoustic  segment. 

Labeling  experiments  are,  therefore,  performed  over  the  set  of  segments  (and  thus 
samples  in  the  pattern  space)  which  wa  would  most  like  our  segmentation  routines  to 


t If  such  a correct  segmentation  is  not  available,  it  is  better  to  use  the  machine 
segmentation  of  the  best  parametric  representation,  with  its  segmentation  errors,  but, 
tuned  to  miss  as  few  segments  as  possible,  than  to  use  a broad,  hand  segmentation  and 
label  over  transition  portions  of  Ihr  signai. 
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provide  in  the  actual  system.  We  will  additionally  show  some  results  of  labeling  the 
segments  provided  by  the  machine  segmenter,  although  we  will  not  evaluate  these 
completely.  It  is  hoped  that  the  construction  of  a labeling  reference  which  combines  our 
expectations  for  acoustic  segments  and  phonetic  labels  will  most  accurately  reflect  the 
expectations  or  needs  of  a speech  understanding  system  at  the  higher  levels  of  analysis. 

8.2  Evaluation  Space 

The  design  space  which  we  are  attempting  to  investigate  Is  quite  large.  Even  after 
we  have  accepted  the  limitstions  discussed  In  Chapter  1 --  the  choice  of  four  parametric 
representations,  ignoring  the  multiple  speaker  normalization  issue,  restricting  ourselves  to 
acoustic  pattern  space  labeling,  and  the  choice  of  simple  distance  metrics  with  static 
training  procedures  --  we  are  faced  with  the  issues  of  recognition  target  set  size,  error 
criterion,  and  segmentation  just  discussed.  Instead  of  spending  any  more  time  justifying 
the  decisions  we  have  made  --  it  is  probably  sufficient  to  have  raised  the  Issues  — we  will 
outline  the  dimensions  to  be  covered  and  the  methods  of  presentation  of  results. 

8.2.1  Experimental  Dimensions 

The  fo’ir  parametric  representations  will  be  used  for  a set  of  40  sentences  from  the 
News  Retrieval  task  (also  used  for  the  segmentation  evaluation).  In  addition,  a second  set 
of  News  Retrieval  sentences,  spoken  by  another  male  American  speaker,  and  third  set, 
used  in  the  developme.".t  and  testing  of  the  Dragon  system  [BakJK75b],  will  be  evaluated. 

With  the  first  set,  as  many  of  the  basic  distance  metrics  — EUC,  COR,  SIG,  and  UK  -- 
as  are  applicable  will  be  used  over  the  SPG,  ASA,  and  ZCC  parameters.  The  modified  ZCC 
parameters  used  by  Dragon  will  also  provide  a point  of  reference  for  one  total  system’s 
performance.  The  ACS  representation  is  to  be  used  with  Itakura’s  specially  designed  log 
ratio  measure  only,  as  it  is  poorly  suited  for  the  more  standard  distance  functions.  (See 
Ichakawa,  [Ich73]  for  the  poor  performance  of  Partial  Correlation  Coefficients.) 

The  purpose  of  the  additional  sets  of  utterances  is,  of  course,  to  Justify  our 
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assertion  that  the  training  and  labeling  process  as  a whole  is  valid  for  more  than  one 
speaker  and  set  of  utterances.  However,  the  additional  data  can  only  help  to  Improve  the 
fidelity  of  our  performance  results, 

Finally,  for  each  experiment,  we  wilt  present  a family  of  results  for  a sequence  of 
phone  class  sets  (see  Figure  8.1).  Each  set  provides  a partitioning  of  the  phonetic  space 
at  a different  degree  of  fineness,  and  may  be  relevant  to  particular  aspects  of  the  total 
speech  understanding  problem. 

8.2.2  Methods  for  Presenting  Labeling  Accuracy 

Schwartz  and  Makhoul  [SchR75]  point  out  that  the  appropriate  response  to  the 
problem  of  ambiguity  in  continuous  speech  is  ambiguity  in  the  segmentation  and  labeling 
output.  That  is,  optional  interpretations  of  the  input  signal  should  be  put  forth  as  possible 
alternative  recognitions  rather  than  one  and  only  one  label  for  each  segment  and  one 
stream  of  segments.^  If  the  labeler  finds  c number  of  plausible  labels  for  a single  speech 
sample,  it  can  do  no  better  than  to  rate  and  return  them  all.  Thus,  it  is  not  sufficient  to 
evaluate  labeling  accuracy  without  indicating  the  criteria  for  accepting  a label  as  such  a 
plausible  interpretation  of  an  input  pattern.  Indeed,  returning  a number  of  labels  may 
merely  be  considered  to  represent  a finer  partitioning  of  the  pattern  space.  There  Is  a 
duality,  which  is  similar  to  the  feature  vs.  template  issue,  between  returning  alternative 
labels  and  using  more  and  finer  ones. 

In  lieu  of  any  more  global  information,  the  labeler  can  only  use  the  pattern  space 
distance  value  with  which  it  orders  the  templates  to  rate  them  as  well,  and  to  select  the 


t While  we  do  not  provide  a lattice  structure  of  segments  and  labels  as  in  the  Speechlis 
system,  neither  do  we  claim  to  make  use  of  phonetic  rules  or  sub-phonemic  segment 
sequences.  The  acoustic  segments  are  detected  with  a strong  bias  towards  the  extra 
segment  end  of  the  ROC  curve,  and  the  next  level  in  Hearsay  II,  for  example,  is  able  to 
combine  many  of  them  in  an  optional  manner  within  the  flexible  data  base  structure  of 
Hearsay  II.  Within  this  optional  segment  structure,  sets  of  alternative  labels  are  re- 
combined according  to  a phonetic  feature  calculus  to  produce  a new  sat  of  most  likely 
labels  for  the  combined  segments. 


Labeling  Performance 


115 


B 

D 

G 

P 

T 

1C 

F 

TH 

V 

DH 

S 

SH 

Z 

ZH 

HH 

DX 

f1 

N 

NX 

EN 

U 

R 

L 

Y 

EL 

UW 

UH 

OU 

AO 

AA 

AH 

AX 

ER 

EH 

EY 

AE 

IX 

IH 

lY 


PiP 

GiB 

TiT 

OtO 

GiG 

PiP  V 

PP.-CH 

StS  Z 

N;f1  N 

UtU  L 

RiR 

YtY 

UUtUU 

UMsUM 

OfhOIJ 

nntnn 

ERiER 
RE:RE 
EH:EH 
IHi  IH 
IYjIY 


SIZE  ISi 

SIL:-  PS  T\  K\ 

USTtP  T K 

VST:B  D C DX  Q 

LMFtP  TH  HH  HH 

SIZE  11* 

VLFiV  DH 

SILi-  P\  T\  K\ 

SIBsCH  S SH 

PLS»P  T K 

VSBtJH  Z ZH 

VSTiB  D C DX  Q 

NRStH  N NX  EM  EN 

FRCiF  TH  CH  JH  S SH  HH  HH 

CLD:H  Y 

VPRiV  jH  Z ZH 

LIQiR  L EL 

NRS:ll  N NX  EM  EN 

UUUsUU  RU 

lcsju  r l y el 

RHU'.UH  RH 

ULU:UU  UH  RH  RH 

RflUtOM  RO  Rfl  RY  OY 

RLU;OH  RO  RR  RE  RY 

OY 

ERVjER 

ILViER  EH  IH  lY  RX 

EY  IX 

RCV:flE 

ODD;Z  ! ► 

flX.-RK 
EYiEY 
flU;RU 
RY:flY 
OYiOY 
EL:  EL 
ENjEM  EN 
0K;0K 

♦:Q  P\  T\  K\ 
IXtIX 
XXiX  ( - 


STPjP  t k b d c 

PRCiS  SH  Z ZH  TH  DH  HH  V P ' 

NflSifl  N NK  EM  EN 
CLQ;H  R L Y EL 

VMLjIY  IH  EH  ER  flX  RH  flO  flfl  RE  UH  UH  OH  RY  EY  RU  IX 
SIL«- 


ODD:*-  I Z OX  Q 


Figure  8.1:  Phone  Class  Sets 
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acceptable  ones.*  If  we  define  a selection  criterion  (e.g.  d(x,y)  i miny  (d(x,y))  + T ) th»n 
we  can  collect  the  statistic:  Prfcorrect  label  passes  criterion).  In  order  to  understand  the 
effect  of  the  acceptance  of  multiple  labels  on  system  performance,  size  of  search,  etc.,  we 
must  also  collect  the  statistic;  E[number  of  recognition  objects  which  pass  criterion]  for 
every  segment  labeled.  This  latter  may  be  considered  as  a measure  of  the  factor  of 
growth  a search  of  the  possible  optional  machine  transcriptions  displays  at  each  new 
segment  --  the  branching  factor  (BF),  These  statistics  can  be  presented  in  graphic  form  as 
a relationship  between  BF  and  Accuracy,  It  ought  to  be  noted  that  the  recognition  objects 
mentioned  above  are  whatever  labels  or  classes  of  labels  we  choose  to  evaluate 
performance  over. 

In  case  we  are  interested  In  the  kind  of  errors  made,  a confusion  matrix  provides 
the  Pr{recognize  yjinput  x),  where  x and  y are  both  members  of  the  particular  set  of 
labels  or  classes  of  labels  under  consideration.  However,  there  is  some  difficulty  in 
presenting  multiple  choices  in  this  format  without  producing  a great  deal  of  extraneous 
information.  The  confusion  matrix  does,  however,  give  an  idea  of  the  error  behavior  of  the 
labeler  --  the  quality  of  the  mistakes  as  well  as  the  quantity.  Combined  with  the 
measurement  of  correctness  provided  by  the  BF/accuracy  graphs,  this  should  be  a broad 
enough  picture  of  labeler  performance  and  yet  provide  enough  detailed  information  for 
special  cases  of  interest.  The  BF  values  may  be  used  to  estimate  system  resource 
demands;  the  confusion  matrix  conditional  probabilities  to  estimate  higher  level  confusions 
(word  confusions  based  upon  incorrect  phonetic  Information,  etc.). 

A final  measure  of  labeling  proficiency  can  be  derived  from  the  signal  detection 
theory  referred  to  in  Chapter  6,  This  measure  can  be  used  to  normalize  for  recognition 
target  set  size.  A detailed  discussion  may  be  found  in  Green,  et  al.  [Gre6A],  but,  basically, 
the  accuracy,  Pr{correct  target  chosen  out  of  N possibilities),  Is  related  to  the  same  d’ 
measure  presented  earlier  --  the  difference  between  the  signal  and  noise  distributions  in 


Wo  do,  in  fact,  use  certain  global  information  In  the  form  of  prosodic  features  mentioned 
in  Chapter  7.  The  effect  of  these  features  is,  however,  integrated  Into  the  pattern 
distance  value  as  soon  as  the  templates  are  matched  with  the  unknown  Input  pattern. 
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some  signal  measure  space.  The  d’  values  given  below  are  computed  from  tables  provided 
in  [Gre64]  and  based  upon  a theoretical  model  which  assumes  all  the  N targets  to  be 
orthogonal  in  the  pattern  space.  Since  this  is  not  true,  because  of  our  need  for  phonetic 
templates  which  may  duplicate  one  another  to  some  degree,  the  accuracy  rates  ere 
consequently  lower,  and  the  d’  values  lower,  than  predicted  by  the  model  for  the  actual 
signal  to  noise  ratio.  They  can,  however,  serve  as  an  Interesting  normalized  measure  for 
comparison  of  labeling  performance,  just  as  they  do  for  segmentation  performance. 

8.3  Results  of  Labeling  — One  Speaker 

There  are  so  many  dimensions  to  even  a simple  labeling  experiment  that  we  wiii 
explore  the  space  along  each  one,  individually,  rather  than  try  to  cover  the  entire  set  of 
possible  labelers.  This  section  presents  the  results  of  labeling  experiments  performed  on 
the  1416  segments  which  comprise  the  40  news  retrieval  sentences  dealt  with  in  Chapter 
6.  The  dimensions  of  interest  are:  Parametric  Representation,  Distance  Metric,  Acceptance 
Criterion  (Branching  Factor),  and  Target  Set  Size.  In  addition,  confusion  matrices  and  hand 
analysis  are  presented  to  provide  a qualitative  picture  of  some  typical  labeling 
performance. 

Figure  8.2  shows  overall  labeling  accuracies  for  the  four  parametric  representations. 
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Figure  8.2:  Labeling  Performance  — Different  Parametric  Representations 
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The  distance  metric  is  the  Euclidean  distance  function^,  and  the  full  set  of  40 
recognition  targets  is  evaluated.  The  values  reported  for  position  p are  Pr{correct 
template  in  position  S p}.  The  expected  number  of  different  targets  (branching  factor)  is 
given  in  parentheses.  In  these  results,  concerned  with  speaker  CC,  the  clustering 
algorithm  of  Chapter  7 provided  63  templates  for  SPG  data,  76  for  ASA,  75  for  ZCC,  and 
87  for  ACS.  All  are  acoustic  templates  for  the  40  phonetic  targets. 

Note  that,  although  ASA  is  a bit  better  than  SPG  for  p-1,  SPG  improves  faster  with 
increasing  position  (BF).  This  may  be  due  to  one  or  two  bad  templates  for  SPG  which 
"capture"  first  place  often  enough  to  affect  that  statistic.  More  careful  tuning  of  the 
template  sets,  while  desirable,  could  not  be  done  for  ail  experiments.  The  ZCC 
representation  is  the  only  clearly  inferior  one,  much  as  might  be  expected  from  its  few, 
broad  filters.  Yet  its  performance  is  not  too  much  worse  than  the  others. 

Figure  8.3  shows  the  representation  fixed  at  ASA,  and  presents  four  distancegu*’e 
metrics.  Again,  the  accuracies  are  given  for  the  first  three  positions.  The  same  template 
set  was  used  throughout. 


P 


EUC  COR  SIG 


LIK 


1 

2 

3 


27.1  (1.0)  25.0(1.0) 

39.1  (1.9)  37.0(1.9) 

50.4(2.8)  49.3(2.8) 


28.7(1.0)  25.1(1.0) 

41.3(1.9)  35.5(1.8) 

50.8(2.8)  44.1(2.3) 


Figure  8.3:  Labeling  Performance  --  Distance  Metrics 

Almost  identical  performance  is  obtained  from  EUC,  COR,  and  SIG.  The  LIK  metric 
ma!;es  use  of  more  information  (the  covariances  within  each  target  training  set),  yet  is 
unstable  for  some  targets.  This  is  due  to  instability  of  the  covariance  matrix  inversion 


t The  ACS  representation  was  run  only  with  Itakura’s  log  ratio  distance. 
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caused  by  insufficient  samples  in  the  training  set.  It  should  be  noted,  hovirever,  that  the 
BF  values  for  LIK  are  lower  at  each  position,  indicating  a greater  likelihood  of  multiply- 
recognized  targets.  For  the  same  BP  values,  UK  performs  comparably. 

In  both  of  the  above  dimensions,  there  is  very  little  difference  among  the  choices 
examined.  We  will  discuss  this  apparent  lack  of  preference  In  the  last  section.  It  is  a 
rather  strong  result  of  this  work. 

Figure  8.4  is  a graphic  display  of  accuracy  versus  Branching  Factor  for  the  SPG/SIG 
experiment. 
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Figure  8.4;  Branching  Factor  vs.  Accuracy 


Five  plots  are  given,  identified  by  the  size  of  the  target  set  used  in  each  evaluation. 
The  BF  plot  gives  a particularly  convenient  view  of  accuracy  against  the  demands  that  will 
be  made  upon  higher  levels  by  excess  options  in  recognition. 

A normalization  may  be  made  for  target  set  size  by  the  signal  detection  model 
discussed  earlier  in  this  chapter.  The  figures  for  BF-1.0  from  the  SPG/SIG  experiment  are 
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Figure  8,5:  Effect  of  Size  of  Target  Set 

given  in  Figure  8.5  with  their  theoretical  detectability,  d*^. 

Tfie  confusion  matrices  in  Figure  8.6  were  obtained  from  the  SPG/SIG  experiment. 
The  entries  are  normalized  by  row  to  estimate,  for  entry  ij,  Pr {target  j In  position  1 | hand 
I).  Clearly,  for  rarely  occuring  hand  labels,  this  estimate  will  bo  less  accurate.  Important 
types  of  confusion  can  be  seen,  however. 

The  last  display  (Figure  8.7)  is  a trace  of  a representative  utterance.  The  entries 
are:  segment  times  (in  centi-seconds),  hand  label,  rank  and  score  (distance)  of  "correct" 
template,  and  --  in  order  of  score  --  template,  score,  and  prosodic  weightalO. 

Appendix  LI  contains  the  BF/accuracy  tables  and  the  confusion  matrices  for  each  of 
the  experiments  run.  Appendix  L2  is  a more  extensive  machine  transcription  of  the  CC 
data  corpus.  This  displays  the  segmentation  and  labeling,  both  machine  and  hand 
produced. 

8.4  Results  of  Labeling  — Other  Speakers  and  Vocabularies 

In  the  last  section,  we  extensively  explored  the  evaluation  space  for  one  speaker 
and  task  (vocabulary).  In  order  to  extend  the  validity  of  our  results  accross  both  the 
speaker  and  vocabulary  dimensions,  we  have  run  a limited  set  of  additional  labeling 
experiments. 

These  Include  a second  speaker  (LE),  again  for  the  news  retrieval  task  and 


t Calculated  by  approximation  — the  procedure  may  be  less  accurata  for  small  size,  (see 
[Gre64]) 
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Figure  R.7:  Trace  of  Labeling  Evaluation 
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vocabulary,  and  a third  speaker  (JB)  with  partly  different  task  Influences.  The  J0 
sentences  were  run  by  the  Dragon  system  and  were  all  recognized  correctly  using  the 
BAK  distance  metric  --  a modified  Euclidean  distance.  Most  Important  to  that  performance 
was  the  carefully  tuned  word  lexicon,  which  provided  a great  deal  of  phonological 
disambiguation. 

Figure  8.8  gives  the  results  for  recognition  of  the  full  phone  set,  in  the  first  three 
positions,  for  these  data  sets,  as  well  as  the  ZCC/EUC  results  reported  above  for  speaker 
CC. 
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Figure  8.8;  Labeling  Performance  --  Other  Speakers  and  Tasks 


There  are  a few  observations  to  be  made.  First,  the  performance  of  the  LE  data  Is 
almost  identical  to  the  CC  data.  This  is  in  spite  of  the  fact  that  many  more  (120)  templates 
were  found  for  the  LE  training^.  However,  the  conditions  under  which  the  recordings 
were  made  and  the  hand  segmentation  and  labeling  performed  were  the  same  for  both  CC 
and  LE  data.  Thus,  the  level  of  representation  of  the  expected  labels  was  very  similar  for 
both  data  sets. 

In  the  case  of  the  JB  data,  the  hand  referents  were  actually  generated  by  a 
modified  form  of  the  Dragon  system.  This  form  sought  to  fit  the  correct  sentence  to  the 


t A totally  arbitrary  variation,  due  to  a different  cluster  rejection  criterion. 
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signal  (using  the  ZCC/6AK  decision  metric)  by  applying  all  the  Knowledge  that  Dragon 
would  have  at  its  disposal  for  recognition.  Then  cpvious  errors  were  hand  co>’rected. 
Clearly,  this  expectation  is  closer  to  the  ZCC/BAK  results.  However,  it  is  also  closer  to  the 
other  acoustic,  machine  results  tested  here  than  a totally  hand  produced  transcription 
would  be.  The  fact  that  the  level  of  representation  of  the  evaluation  referent,  and  not  the 
decision  metric  and  parametric  representation.  Is  the  cause  of  this  increased  accuracy  Is 
shown  by  the  last  column.  The  same  data  run  on  ASA/EUC,  neither  used  by  the  machine 
transcription  'orm  of  Dragon,  produced  the  same  high  performance  results  as  the  ZCC/EUC 
experiment. 

S.5  Discussion 

A short  discussion  of  the  preceding  labeling  evaluations  is  In  order.  The  lack  of 
significant  differences  among  a large  number  of  tiie  experiments  might  seem  rather 
counterintuitive  in  the  light  of  the  differences  amo.ng  the  parametric  representations  in 
segmenting.  However,  the  tasks  of  segmentation  and  labeling  are  quite  different,  end  we 
believe  this  lack  of  comparative  difference  to  be  an  important  result.  In  addition,  the 
labeling  may  seem  to  be  performing  at  a rather  low  level  in  comparison  to  other  systems 
such  as  were  discussed  in  chapter  4.  We  claim  that  this  level  of  performance  is,  in  fact, 
reasonably  good  performance  for  the  current  state  of  the  art.  It  merely  needs  to  be 
evaluated  at  a lower  level  of  representation  than  has  been  done. 

In  a preliminary  study  of  labeling  accuracy,  we  found  considerable  differences 
among  both  representations  and  metrics.  The  experimental  set-up  was,  however,  quite 
different.  One  template  per  target  sound  was  acquired  by  averaging  a number  of  training 
samples.  Testing  was  then  performed  over  the  same  data  set.  In  such  a situation,  second 
moment  data  greatly  aided  the  SIG  and  UK  metrics  in  identifying  the  training  populations 
correctly.  In  addition,  averaging  had  a more  disastrous  effect  upon  some  parametric 
representations  than  others. 

In  the  current  experiments  a significant  amount  of  knowledge  about  the  distribution 
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of  patterns  for  speech  sounds  is  contained  in  the  multiple  templates  for  each  target  sound. 
Thus,  the  theoretical  shape  of  the  decision  boundaries  is  of  much  less  significance  In 
affecting  error  rates.  Since  we  do  not  test  and  train  on  the  same  set,  performance  Is 
poorer  (more  realistic),  and  the  specific  distribution  of  *he  training  camples  is  less 
significant. 


Errors  of  a higher  level  are  probably  responsible  for  the  performance  shown  in  the 
last  section.  Coarticulation,  effects  of  phonological  and  phonetic  variations,  prosodic 
effects,  and  other  sources  of  confusion  are  all  beyond  the  capacity  of  a simple  template 
matching  routine.  It  is  significant,  in  fact,  that  the  labeling  performances  are  so  much  alike. 
This  similarity  indicates  that  most  of  the  action  available  to  acoustic  level  labeling  is  being 


achieved. 


A second  issue  is  the  low  accuracy  reported  in  the  last  section.  The  expla.^ation  is 
clearly  the  lack  of  any  higher  level  knowledge  in  our  labeling  routine.  But  to  support  our 
claim  to  reasonable  performance  of  this  simplified  labeler,  we  can  point  out  the  following. 
First,  the  Itakura  log  ratio  metric  has  been  tested  in  a word  recognizer  by  Itakura  [Ita75] 
and  has  yielded  excellent  results  for  limited  speech  recognition  tasks.  The  same 
parametric  representations  and  metric  yield  less  than  307.  accuracy  at  the  acoustic  level, 
and  close  to  987.  at  the  word  level. 


A second  point  is  Baker’s  Dragon  System.[BakJK75b]  The  parametric  representation 
and  distance  metric  used  are  essentially  ZCC/EUC  (with  some  amplitude  normalization). 
This  classification  function  was  tested  and  gave  results  comparable  to  the  results  reported 
above.  Templates  generated  b’.'  our  clustering  routine  were  substituted  for  the  standard 
phonetic  spellings  used  by  Baker  in  his  initial  development.  In  'idditlon,  phonological  rules 
w»ere  applied  to  the  lexical  entries  for  mistaken  words  to  produce  alternative  template 
sequences  for  those  words.  However,  none  of  this  tuning  was  performed  on  the  test  data. 
Dragon  was  run  on  ? set  of  578  words  in  102  sentences  from  five  tasks,  one  speaker,  with 
a dictionary  of  354  words.  The  word  level  accuracy  reported  was  greater  than  997. 
Phonological  and  syntactic  knowledge  sources  were  sufficient  to  correct  all  the  errors  of  a 
307  accurate  labeier. 
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When  the  Dragon  system  knowledge  sources  are  used  (in  a modified  form  of  the 
system)  to  generate  a transcription  of  the  known  utterance,  they  produce  a referent 
transcription  more  suited  to  evaluating  acoustic  labeling.  Comparison  with  this  referent 
yields  accuracies  of  AOi,  and  greater  for  the  least  capable  parameters,  ZCC. 

Finally,  we  may  refer  to  Shockey  and  Reddy's  foreign  language  transcription 
experiment  [Sho74a]  and  to  Klatt  and  Stevens’  spectrogram  reading  results  [Kla72].  Our 
results  are  quite  comparable  with  trained  phoneticians  reading  spectrograms  or  waveforms 
in  the  absence  of  any  iiJgher  level  linguistic  support.  They  are  not  much  worse  than 
human  performance  with  auditory  input. 
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In  this  concluding  chapter,  we  would  like  to  Jraw  together  some  of  the  previously 
discussed  results  and  methods  into  a more  coherent  view  of  speech  recognition  activities 
at  the  parametric  level.  At  the  same  time,  we  should  also  restate  the  major  contributions 
of  this  work  and  point  out  particular  areas  where  future  effoi  t is  warranted. 

We  will  first  offer  a brief  summary  of  the  entire  thesis,  In  order  to  draw  together 
some  of  the  major  points  and  restate  the  primary  results.  We  will  then  list  the 
contributions  with  short  discussions,  and  finally  proceed  to  discuss  the  parametric  level  of 
speech  understanding  systems  in  the  light  of  this  work.  The  last  section  will  be  devoted 
to  possible  areas  for  further  research. 

9.1  Summary  of  the  Thesis 

This  thesis  is  a study  of  machine  speech  recognition  at  the  parametric  level.  It 
attempts  to  evaluate  and  understand  ♦he  relative  merits  of  a number  of  alternative  design 
choices  at  that  level.  Such  a study  raises  issues  In  Artificial  Intelligence,  Linguistics, 
Acoustics,  Pattern  Recognition,  Statistics,  and  Speech  Understanding  research.  Ir> 
particular.  It  involves  an  investigation  of  segmentation  and  labeling  techniques,  and  the  use 
of  parametric  representations  for  the  acoustic  signal  in  those  techniques.  Every  speech 
recognition  system  employs  some  parametric  representation  and  some  Initial  slgnal-to- 
symbol  transformation.  Wo  show  the  performance  currently  available  for  those  Initial 
processes,  and  assort  that  such  parformance  is  comparable  to  human  performance.  We 
present  the  relative  merits  of  some  typical  parametric  representations,  and  develop  a 
methodology  for  such  comparative  evaluation.  Simple,  parameter-independent  schemes  for 
segmenting,  labeling,  and  training  are  developed  as  well.  The  role  of  pattern  classification 
techniques,  as  they  relate  to  the  initial  signal -to-symbol  transformation.  Is  clarified. 
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9.1.1  Background 

Although  most  of  our  Knowledge  about  how  to  recognize  and  understand  speech  is 
taken  from  human  performance,  the  structure  of  computer  speech  recognition  and 
understanding  systems  is  of  particular  importance  to  this  study.  Knowledge  about  speech 
is  generally  organized  into  separate  sources  of  knowledge;  each  works  with  a 
representation  of  the  information  content  of  the  input  utterance.  These  representations 
may  exist  at  a number  of  different  levels,  as  suggested  by  their  elements:  speech  sounds, 
phonetic  gestures,  phonemes,  syllables,  words,  syntactic  units,  concepts,  etc.  In  evaluating 
the  performance  of  recognition  processes  at  the  parametric  representation  level,  we 
eliminate,  as  much  as  possible,  the  effects  of  ambiguities  from  other  levels.  Such 
ambiguities  as  coarticulation  or  phonological  variation  will  strongly  affect  the  degree  to 
which  the  expected  transcription  of  an  utterance  corresponds  with  the  acoustic 
performance.  A great  difficulty  in  comparing  published  results  Is  that  the  level  of  the 
Knowledge  used  in  recognition  and  the  representation  used  for  evaluation  are  not  usually 
specified.  Usually,  only  total  system  performance  may  be  compared,  not  the  effectiveness 
of  component  methods. 

9.1.2  Parametric  Representations 

Parametric  representations  fall  into  a few  major  types;  typical  examples  of  each 
have  been  chosen  for  study.  A bank  of  broad-band  filters  (ZCC)  with  amplitude  and  zero- 
crossing measurements,  and  a bank  of  narrow-band  filters  (ASA),  amplitude  only,  represent 
analog  methods.  A digital  Fourier  transform  of  the  LPC  filter  [Mar72]  produces  a smoothed 
spectral  envelope  (SPG)  very  much  in  current  use.  Finally,  the  autocorrelation  sequence 
(ACS)  is  employed  with  a special  method  designed  for  it.[lta75]  Each  method  yields  a set  of 
measurements  at  uniform,  short  intervals  — a pattern. 
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9.1.3  Diotanco  Motrics 

Distance  functions,  chosen  from  Pattern  Classification  theory,  are  then  applied  to  the 
parameter  patterns  as  measures  of  acoustic  similarity.  The  basic  model  adopted  Is  that  of 
a vector  of  parametric  measurements  for  each  pattern.  These  vectors  define  a space  of 
possible  patterns;  within  this  space  a measure  of  distance  may  be  applied  between 
patterns.  As  populations  of  sample  patterns  are  accumulated,  better  statistical 
descriptions  may  be  estimated  of  the  true  distribution  of  those  patterns  In  the  space.  A 
simple  example  might  be  to  collect  all  the  occure*'ces  of  a phone  and  compute  the  mean 
and  variance  of  each  dimension.  Then  a suitable  measure  of  similarity  might  be  Euclidean 
distance,  weighted  by  variance,  to  approximate  a measure  of  the  deviation  from  population 
mean.  This  is  one  distance  metric  chosen  (SIG).  The  others  are  Euclidean  distance  (EIX), 
Correlation  (COR)  — the  magnitude  normalized  dot  product,  and  Maximum  Likelihood  (LIK). 
In  tnis  last,  the  population  covariance  matrix  is  used  to  calculate  Pr{unknown  produced 
from  population},  under  the  assumption  of  Gaussian  distributions. 

9.1.4  Segmentation 

A method  for  segmenting  speech  into  isolated,  acoustically  consistent  segments  is 
presented.  The  method  is  fairly  independent  of  the  choice  of  parametric  representation, 
since  it  relies  upon  the  acoustic  similarity  measure  as  the  primary  evidence  of  acoustic 
change.  First,  however,  c threshold  is  applied  to  the  signal  amplitude  measurement  to 
discriminate  between  speech  and  silence.  Then  the  speech  portion  is  examined  further.  In 
collecting  evidence  for  a segment  boundiiry,  a measure  of  change  is  applied  to  neighboring 
parameter  patterns.  This  measure  produces  e time  sequence  of  values  whose  peaks  are 
detected  and  subjected  to  a threshold  for  acceptance  or  rejection.  A composite  of  such 
functions  yields  the  final  segmentation.  Narrow  and  broad  pattern  similarity  and  amplitude 
change  are  the  three  functions  applied  to  the  non-siience  portions  of  the  signal.  This 
process  is  very  much  like  the  process  hypothesized  in  the  basic  model  for  Signal  Detection 
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[Ega64].  That  model  may  be  applied  to  the  problem  of  evaluating  segment  boundary 
"detectability." 

Missing  and  extra  segment  errors  are  found  to  be  as  good  as  ^1.  and  19X, 
respectively.  Significant  differences  in  the  segmentation  effectiveness  of  the  parametric 
representations  is  found.  They  may  be  ordered  as  follows:  SPG,  ACS,  ASA,  and  ZCC.  The 
best  performance  is  found  to  be  comparable  to  the  state  of  the  art.  Little  reduction  In 
accuracy  is  encountered  when  new  speakers  are  tested. 

Figure  9.1  shows  the  results  of  segmentation  for  40  sentences  .from  the  News 
Retrieval  task,  one  speaker. 
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Figure  9.1:  Segmentation  — Different  Parametric  Representations 


The  reference  segmentation  contains  1082  segments,  primarily  at  the  phonemic  level 
of  description.  The  second  reference  contains  corrections  to  this  file  (1541  segments),  to 
make  It  more  an  acoustic  description  of  the  corpus.  The  number  of  machihe  segments 
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reported  may  be  greater  than  the  sum  of  this  second  size  (hand  reported  acoustic 
s.'gmcnts)  and  the  number  of  extra  boundaries.  The  discrepancy  Is  an  artifact  of  the  way 
we  evaluate  segmentation.  Occasionally,  two  machine  boundaries  will  fall  close  enough  to  a 
hand  boundary  so  that  both  are  accepted.  Such  segments  must,  therefore,  be  very  short, 
and  are  usually  transition  segments  which  may  easily  be  deleted  at  higher  levels.  The 
number  of  missing  boundaries  (segments),  divided  by  the  number  of  boundaries  which  are 
included  in  both  reference  segmentations,  is  the  missing  segment  error  rate.  The  number 
of  shifted  boundaries  is  also  divided  by  this  number.  The  number  of  extra  boundaries  is 
divided  by  the  number  of  primary  segments  (the  size  of  HI  in  this  case).  The  extra 
segment  rates  in  parentheses  are  those  where  division  is  by  the  number  of  acoustic 
segments  (size  of  H2).  The  value,  d’,  is  a single  measure  of  detectability  from  the  Signal 
Detection  model.  It  has  the  effect  of  normalizing  for  the  trade-o'f  between  missing  and 
extra  segment  errors. 

9.1.5  Labeling 

Lcrbeling  is  accomplished  by  simple  pattern  distance  metrics.  Given  a set  of  phonetic 
elements  as  the  recognition  targets,  a set  of  templates  for  each  target  is  derived  from  the 
training  data.  This  is  achieved  by  a clustering  algorithm  developed  for  the  purpose  of 
encoding  into  the  set  of  templates  some  of  the  ambiguities  encountered  because  of 
allophonic  variation.  The  pairwise  distances  are  computed  for  all  pairs  of  sample  patterns 
in  the  training  population  for  a particular  phonetic  target.  Then  a threshold  Is  chosen  from 
these  values,  and  the  distances  below  threshold  are  marKed.  The  sample  pattern  in  the 
most  marked  pairs  is  chosen  as  a representative  template  and  all  its  marked  mates  are 
discarded.  After  iterating,  the  population  is  divided  into  clusters  of  various  sizes,  each 

with  a "best"  representative  template  pattern.  Clusters  of  sufficiently  small  size  are 
ignored. 

Labeiing  itself  proceeds  by  computing  the  distance  from  the  unknown  pattern  to 
each  template.  In  addition  to  the  distance  metrics  mentioned,  three  prosodic  features  of 
each  segment  - the  average  amplitude,  the  duration,  and  the  amplitude  contour  of  the 
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surrounding  segments  --  are  used  tc  increase  the  distances  to  templates  whose  prosodic 
features  are  considerably  different. 

The  set  of  templates  (and  their  appropriate  target  labels)  and  the  distance  scores 
give  the  total  recognition  information  available  from  this  straightforward  labeler.  If  some 
criterion  is  placed  on  the  templates  which  one  is  willing  to  report  to  the  rest  of  a system, 
then  accuracy  may  be  measured  as  a function  of  the  severity  or  looseness  of  that 
criterion.  If  the  true  effect,  upon  a speech  recognition  system,  of  loosening  the 
acceptance  criterion  is  to  be  understood,  one  must  also  measure  the  expected  number  of 
separate  targets  reported  at  each  instance.  We  call  this  the  Branching  Factor  (BF),  and 
collect  it  as  well  as  accuracy  statistics  in  evaluating  labeling  performance. 

Little  difference  is  observed  along  the  parametric  representation  or  the 
classification  metric  dimensions,  except  for  poorer  performance  for  ZCC  Input.  Each  Input 
segment  is  labeled  as  one  of  a set  of  40  phone  labels.  The  correct  phone  appears  as  the 
first  choice  287.  of  the  time.  It  appears  in  the  first  three  choices  557  of  the  time. 
However,  when  a lower  level,  acoustic  transcription  is  used  as  the  evaluation  referent, 
these  values  increase  to  427  and  657.  Even  the  287  accuracy,  which  arises  from  a 
comparison  against  phonemic  expectation,  is  acceptable  performance;  it  is  the  same  as  or 
slightly  better  than  human  spectrogram  reading  performance  In  the  absence  of  other 
linguistic  clues.[Sho74a] 

Figure  9.2  shows  overall  labeling  accuracies  for  the  four  parametric  representations. 
The  distance  metric  is  the  Euclidean  distance  function^,  and  a se»  of  40  phonetic 
recognition  targets  is  used.  The  values  reported  for  position  p are  Pr{correct  template  in 
position  s p).  The  expected  number  of  different  targets  (branching  factor)  is  given  in 
parenthesis. 

Figure  9.3  is  a graphic  display  of  accuracy  versus  Branching  Factor  for  the  SPG/SIG 
experiment.  Five  plots  are  given,  identified  by  the  size  of  the  target  set  used  in  each 
evaluation.  The  BF  plot  gives  a particularly  convenient  view  of  accuracy  versus  the 
demands  that  will  be  made  upon  higher  levels  by  excess  options  In  recognition. 


t The  ACS  representation  was  used  only  with  ItaHura’s  log  ratio  distance. 
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P SPG  ASA  zee  AeS 

1 24.G(1.0)  27.1  (1.0)  20.3(1.0)  28.7(1.0) 

2 42.4(1.9)  39.1  (1.9)  31.4(1.9)  44.4(1.9) 

3 54.0(2.8)  50.4(2.8)  42.0(2.8)  54.B(2.7) 


Figure  9.2:  Labeling  --  Different  Parametric  Representations 
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Figure  9.3:  Branching  Factor  vs.  Labeling  Accuracy  for  Various  Target  Sett 


9.2  Contributions 

9.2.1  A Comparison  of  Parametric  Repreientationa 

It  should  be  clear  that  any  effort  to  compare  the  various  parametric  reprr'sentations 
in  terms  of  their  suitability  for  use  in  speech  recognition  systems  is  needed,  and  such 
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results'  are  of  value.  What  Is  not  clear  is  how  accurately  such  comparisons  have  been 
made  and  with  what  confidence  they  can  be  applied  to  predicting  performance  and 
designing  systems.  It  is  our  contention  that  the  uniform  manner  in  which  we  have  applied 
the  tested  representations  to  segmentation  and  labeling,  the  simplicity  and  ubiquitous 
nature  of  the  pattern  classification  assumptions,  the  quantities  of  data,  and  the  care  with 
which  we  have  evaluated  the  results  reported  in  this  dissertation  all  contribute  to  the 
fidelity  of  those  results  and  to  our  confidence  in,  at  least,  the  relative  strengths  of  the 
representations  reported. 

In  the  segmentation  process,  the  choice  of  parametric  representation  shows 
significant  effect.  The  best  results  obtained  --  for  SPG,  missed  rate  - 47^  extra  rate  - 19^ 
--  may  be  compared  to  the  next  best,  ACS,  by  normalizing  with  d’  for  opo  of  the  error 
’‘utes.  ACS  for  the  same  missed  rate  as  SPG  would  produce  337,  extra  segments. 

Labeling  performance  is  not  afrected  by  choice  of  parametric  representation  among 
the  three:  SPG,  ACS,  and  ASA.  Nor  does  the  choice  of  distance  metric  for  the  labeling 
algorithm  have  any  effect.  Although  the  top  labeling  choice  is  correct  only  about  257.  of 
the  time,  in  over  507  of  the  segments  labeled,  the  correct  choice  Is  among  the  top  three 
choices.  This  behavior  agrees  well  with  human  performance  at  spectrogram  and  waveform 
reading  under  similar  constraints.  If  we  believe  that  the  human  spectrogram  reading  was 
very  competent,  then  while  the  representations  may  not  afford  all  the  Information  about 
speech  available  to  the  human  ear,  the  labeler  Is  performing  about  as  well  as  may  be 
expected  with  those  representations  as  input. 

9.2.2  Parameter-Independent  Segmentation 

The  segmentation  algorithm  described  in  Chapter  5 may  be  easily  adapted  to  any 
set  of  parameters  and  any  measure  of  similarity  In  the  space  of  those  parameter  vectors. 
The  algorithm  need  not  be  trained  for  every  speaKer,  and  worKs  well  for  a variety  of 
parametric  representations.  Our  experience  with  it  has  Indicated  very  little  degradation  of 
performance  accross  speaKers,  with  recording  conditions  constant.  This  Is  probably  due  to 
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the  fact  that  boundary  detection  depends  upon  detecting  changes  in  the  pattern  space. 
Large,  sudden  changes  will  be  detected  by  any  reasonable  scheme  for  segmentation.  The 
small  or  slow  changes  are  the  most  difficult  to  detect  correctly.  Yet,  these  small  shifts  are 
not  affected  by  large,  smooth  transformations  to  the  overall  pattern  space  which  may 
characterize  speaker  change,  (eg.  Different  format  locations,  for  a new  speaker,  will  not 
seriously  affect  the  detection  of  format  shifts.)  Labeling,  on  the  other  hand,  depends  upon 
gross  comparisons  of  patterns  to  a much  greater  extent;  it  is  more  dependent  upon  the 
absolute  locations  in  the  space  of  patterns  for  particular  phone  templates.  The  segmenter 
represents  an  available  tool  and  a benchmark  for  acoustic  level  segmenting  whose  overall 
performance  is  comparable  to  other  current  programs,  tvloreover,  the  method  of  threshold 
acquisition  is  easily  adapted  to  more  dynamically  sensitive  techniques,  as  will  be  mentioned 
below. 

9.2.3  The  Role  of  Primitive  Pattern  Classification  Methods 

In  the  past,  many  of  the  methods  and  results  available  from  statistical  pattern 
classification  research  have  been  dismissed  or  tacitly  assumed  of  small  value  without 
sufficient  attempt  to  understand  the  implications  of  the  surrounding  issues  (training,  target 
sets,  metrics,  etc.)  which  strongly  affect  performance.  It  is  hoped  that  this  work  will 
stimulate  further,  careful  application  of  the  methods  involving  stochastic  pattern  spaces.  It 
is  apparent,  however,  that  much  of  the  disenchantment  with  these  techniques  stems  from 
their  failure  to  solve  the  recognition  problem  at  a level  accessible  to  higher  level 
knowledge  sources.  Pure  statistical  classification  approaches  will  not,  In  our  opinion, 
provide  such  a solution,  as  our  low  initial  labeling  accuracies  indicate  --  257.  may  sound 
low  to  the  uninitiated.  The  analysis  of  the  proper  roles  for  pattern  classification  methods 
which  we  have  presented.  In  the  context  of  testing  their  usefulness,  may  also  serve  to 
define  fruitful  avenues  for  applying  more  sophisticated  pattern  classification  techniques.  A 
great  deal  has  been  written  about  the  acoustic  level  of  speech,  but  fairly  little  attention 
has  been  paid  to  techniques  which  are  specifically  and  specially  suited  to  computer 
implementation.  We  feel  that  pattern  classification  methods  are  so  suited.  - One  interesting 
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result  of  our  wor!\,  with  direct  implications  for  pattern  classification  techniques  and 
speech,  is  the  irrelevence  of  second  moment  statistics  for  describing  the  template 
distributions  --  the  distance  metric  dimension.  For  example,  ASA  parameters  were  labeled 
with  the  four  metrics:  EUC,  COR,  SIG,  and  UK,  the  last  two  utilizing  variance  and 
covariance  statistics,  respectively.  When  accuracy  results  are  calculated  for  a fixed 
branching  factor  of  3.0,  all  accuracies  are  within  17,  of  507..  Either  the  clusters  are  fine 
enough  divisions  of  the  pattern  space  to  capture  all  the  rele^  cr.l  information  about  the 
target  population  distribution  without  the  need  to  involve  second  moment  data,  or, 
alternatively,  the  distributions  are  spherical  (i.e.  the  parameters  are  uncorrelated).  It  is 
hard  to  imagine  adjacent  narrow  filter  bands  in  the  range  of  frequencies  in  question  to  be 
uncorrelated,  so  it  is  likely  that  the  former  argument  is  more  valid.  This  implies  that  the 
emphasis  for  speech  applications  of  pattern  classification  techniques  should  be  In 
clustering,  tracking,  or  dynamic  training  --  to  capture,  empirically,  the  complexities  caused 
by  stress,  coarticulativc,  and  other  phonetic  variations. 

9.2.4  Methodology  for  Evaluation 

Closely  related  to  our  view  of  the  role  of  pattern  classification  Is  the  methodology 
we  have  adopted  for  evaluating  performance  in  those  pattern  spaces.  Wo  have  ossentiaiiy 
found  that,  in  order  to  evaluate  accuracy  fairly,  one  must  have  a fair  representation  of 
what  is  expected  or  correct.  As  an  example,  labeling  accuracy  over  a set  of  40  phones 
increases  from  207.  to  407.  when  the  labeling  referent  is  supplied  by  a machine  aided 
process  rather  than  purely  hand  transcription.  The  machine  aided  process  find  the  best  fit 
to  the  acoustic  input  of  label  templates,  constrained  by  the  stored  phonological  variations 
in  a word  lexicon.  Alth‘>t,!gh  this  method  yields  higher  accuracy  measurements,  we  have 
not  used  it  because  it  is  open  to  question.  The  referent  Is  being  generated  primarily  by 
the  same  process  which  Is  to  be  tested.  However,  we  believe  that  the  machine  aided 
referents  generated  are  as  valid  descriptions  as  are  the  hand  transcriptions.  Much  of  our 
difficulty  has  been  in  acquiring  hand  segmentations  which  represent  a fair  expectation  for 
acoustic  level  performance. 
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The  methods  and  attitudes  presented  are  applicable  to  any  parametric 
representation,  to  any  segmentation  and  labeling  output,  and  to  a number  of  levels  of 
description  of  speech.  Where  knowledge  in  addition  to  acoustic/phonetic  knowledge  is 
used,  the  results  will  be  more  like  higher  level  representations  and  may,  therefore,  yield 
higgler  accuracies  when  compared  to  hand  transcriptions.  For  the  IBM  labeling  and 
segmentation  program,  reported  accuracy  at  the  phonemic  level  is  62^.  The  program  uses 
digital  spectral  parameters  and  a pattern  matching  scheme  very  similar  to  our  own  as  input 
to  a detailed  set  of  phonetic  and  phonological  rules.  Such  performance  results  are  valid 
measurements  to  be  applied  to  the  analysis  of  total  system  performance,  even  though  they 
may  imply  less  about  particular  aspects  (such  as  a particular  parametric  representation) 
than  our  primitive  level  evaluations. 

9.2.5  Signal  Detection  Model 

Applying  the  model  of  signal  versus  noise,  and  the  d’  measure  of  signal  detectability 
[Tan64]  may  prove  quite  useful  for  modeling  errors  of  raw  segmentation  and  labeling 
output  over  a wide  range  of  performance  trade-offs.  Notably,  extra  versus  miss'ng 
segment  errors,  and  recognition  set  size  can  be  normalized  by  d’,  and  whole  ranges  of 
performance  predicted.  While  the  validity  of  this  model  has  not  been  completely 
established,  our  preliminary  success  with  it,  added  to  the  large  amount  of  human 
perception  research  supporting  it  as  a model  of  detection,  seem  to  lend  it  credence.  We 
also  see  possible  applications  in  predicting  performance  of  systems  under  simulated 
errorful  inputs,  prior  to  implementing  the  actual  knowledge  sources. 

9.2.6  Clustering 

Our  success  with  the  acoustic/phonetic  clustering  algorithm  gives  us  hope  Of  even 
further  gains  to  made  in  this  direction.  By  using  multiple  templates  for  various  acoustic 
manifestations  of  phones,  we  are  able  to  describe  complex  partitionings  of  the  pattern 
space  with  simple  metrics.  In  addition,  we  are  able  to  factor  out  effects  which  alter  the 
acoustic,  but  not  phonetic,  natu-e  of  the  signal.  Dynamic  methods  for  tracking  clusters  may 
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be  applicable  here.  At  minimum,  this  routine  provides  a method  for  Integrating  two  levels 
of  representation,  acoustic  and  phonetic,  which  are  often  difficult  to  correlate.  Once  again, 
the  routine  is  applicable  to  any  <pattern  space,  distance  metric>  definition  of  similarity  of 
two  speech  samples. 

9.3  Parametric  Representations 

The  design  or  choice  of  a set  of  acoustic  parameters  for  speech  recognition  analysis 
is  still  a difficult  problem.  We  have  shown  how  accuracy  is  improved  --  usually  at  the  cost 
of  increased  computation  and  memory  requirements  — by  choosing  more  informationally 
complete  representations.  However,  this  is  not  the  only  source  of  computational  costs. 
Lower  accuracy,  in  most  systems,  introduces  larger  data  bases  and  more  extensive 
searches  at  higher  levels.  Thus  a total  system  analysis  of  costs  in  memory  and  speed 
requirements  versus  accuracy  must  be  made  by  the  system  designer  if  the  choice  of 
parametric  representation  is  to  be  made  with  cost  in  mind. 

At  the  present  state  of  the  art,  emphasis  has  been  uiaced  mainly  upon  accuracy, 
since  that  aspect  of  performance  is  the  most  critical  to  a number  of  the  goals  of  speech 
understanding  systems,  even  to  overall  speed  and  memory.  However,  If  systems  are  to  be 
designed  for  limited  resources,  low  cost,  and  real-time  operation,  exqessive  parametric 
information  must  be  trimmed  away.  In  another  sense  as  well,  parametric  Information 
should  be  as  sparse  as  is  necessary  to  meet  the  system  performance  goals.  This  Is  in 
order  to  reduce  the  likelihood  that  extraneous  aspects  of  an  input  pattern  will  lead  to 
error^.  A number  of  methods  for  selecting  parts  of  the  parametric  pattern,  according  to  a 
priori  decisions  about  speech  class,  are  available,  from  sequential  decision  methods  [Fu68] 
to  specific  parameters  designed  for  such  classes  [Wei75,  Ata75]. 

Of  the  performance  Information  reported  In  Chapter  8,  the  confusion  matrices  a{;e 
the  most  useful  to  a system  designer  concerned  with  special  cases  — specific  situations 

t For  example,  in  a situation  where  one  is  reasonably  sure  of  a fricative  sound,  a lot  of 
information  about  resonant  structure  in  the  lower  frequencies  is  worse  than  useless.  It 
may  actually  cause  mislabel  ng  to  a high  vowel  if  there  Is  any  voicing  present  (as  there 
often  is). 
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where  particular  parametrir  representations  may  fail  or  succeed.  An  extensive  analysis  of 
sub-matrices  such  as  is  found  in  Weinstein,  et.al,  [Wei75]  should  really  be  structured  to 
match  the  particular  higher  level  Knowledge  and  particular  requirements  of  processes 
found  in  each  system.  Wj  cannot  give  specific  recommendations  of  the  type,  "Use 

parameters  X for  case  A..,,"  since  the  cases  of  interest  are  determined  by  the  Individual 
systems. 

The  overall  performance  results  do  reflect  actual,  continuous  speech  recognition  of 
American  English  sentences.  In  that  respect,  they  reflect  tho  o priori  distributions  of 
phonetic  types,  coarticulation  situations,  stress  and  pitch  variations,  etc.,  which  are  likely 
to  be  encountered  under  similar  conditions  of  speech.  The  results  are,  therefore,  more 
valid  for  prediction  than  If  they  had  been  compiled  from  artihcial  word  lists  or  from  a 
smaller  data  corpus.  In  the  light  of  such  a belief,  the  results  indicate  the  relative 
effectiveness  with  which  segmentation  and  labeling  can  be  performed  at  the  most  primitive 
level  of  lecognition,  averaged  over  a number  of  different  situations.  Since  most 
knowledge  sources  will  build  upon  primitive  decision  mechanisms,  we  feel  the  comparative 
results  reported  here  will  be  valuable  even  for  more  sophisticated,  phonetic  level  speech 
recognition  programs. 

9.4  Parametric  Level  Knowledge  Sources 

In  comparison  with  some  of  reported  work  at  the  acoustic/phonetic  level,  our 
segmentation  and  labeling  routines  may  seem  rather  harshly  limited  to  the  parametric 
representation  level  only.  However,  our  view  has  been  that  other  Knowledge  can  be 
applied  by  separate  processes  at  separate  times  if  the  system  structure  Is  sufficiently 
flexible.  This  is  neither  a new,  nor  an  extremely  insightful,  point  of  view,  but  It  does  allow 
us  to  focus  or  the  set  of  recognition  decisions  which  occur  prior  to  any  phonetic  or 
phonological  analysis.  It  also  Is  a logical  extension  of  tho  concept  of  modularly 
implemented  separable  sources  of  knowledge  so  often  expressed  In  the  literature. 
[New71,  Red73,  Erm7Ab,  Les75,  Woo75] 
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We  suggest  that  a reasonable  approach  to  analysis  of  speech  at  this  level  is  to  maKe 
the  transformation  from  parametric  representation  to  acoiistic/phonetic  segments  as  soon 
as  possible  and  with  as  much  information  as  is  relevent.  The  machine  transcription 
occupies  considerably  less  space  than  the  complete  parametric  representation  of  the 
signal.  In  addition,  the  kind  of  processing  needed  to  create  this  transcription  is 
straightforward  and  easily  performed  in  a parallel  manner,  perhaps  with  special  purpose 
machinery,  or  off-line,  in  cases  where  experiments  are  run  before  or  during  system 
development. 

Wo  have  shown  that,  by  using  simple  pattern  classification  techniques,  reasonable 
labeling  and  good  segmentation  performance  may  be  achieved.  Using  these  simpie  pattern 
space  measures  for  many  decisions  yields  the  additional  bonus  of  parameter  independence. 
The  routines  are  not  built  to  accommodate  particular  parameters,  but  rather  designed  to 
make  use  of  the  information  inherent  in  the  occuring  popuiations  of  entire  pattern  vectors. 
Thus,  the  method  of  extracting  parameters  may  be  cha’^ged  during  system  deveiopment,  or 
after,  whenever  better  methods  are  found,  and  the  routines  may  be  expected  to  work  weii 
without  extensive  re-tuning. 

We  are  not,  however,  arguing  against  the  use  of  more  compiex  decision  procedures, 
nor  against  more  feedback  from  higher  level  knowledge  sources.  Rather,  we  stress  the 
need  to  make  as  complete  and  valid  use  of  the  patterns  of  parameters  as  is  possible,  as 
early  (low  level)  as  possible.  This  requires  detailed  knowledge  of  the  statistical  nature  of 
the  pattern  space,  encoded  in  the  trained  tempiates,  distance  measures,  and  related 
aspects  of  the  pattern  classification  functions. 

9.5  Evaluation 

Our  approach  to  parametric  level  processing  by  simple  pattern  classification 
requires  that  one  disengage  higher  level  prejudices  and  expectations  from  one’s  evaluation 
of  the  accuracy  of  such  methods.  Our  philosophy  for  performance  evaluation  has  been  to 
expect  what  Is  really  in  the  input  to  appear  In  the  output  transcription,  and,  additionally,  to 
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expect  absent  what  is  not  in  the  input.  For  example,  if  we  consider  only  individual 
parameter  vectors  extracted  at  a single  short  interval  of  time,  we  cannot  hope  to  Integrate 
into  our  decisions  segment-sequence  phonetic  information  such  as  Is  available  in  stop 
consonants.  This  is  not  to  say  that  such  complex  patterns  are  not  In  the  input  as  a whole, 
just  not  in  the  particular  input  to  the  decision  rule  being  evaluated.  Rather,  we  suggest 
that  the  right  time  to  determine  whether  such  information  is  preserved  by  the  lower  level 
routines  is  when  higher  level  knowledge  sources  are  evaluated,  To  continue  the  example, 
since  a /t/  burst  is  acoustically  similar  to  an  /s/,  we  are  somewhat  satisfied  if  our  labeler 
returns  /s/  for  some  of  the  /t/  bursts.  If  is  the  job  of  the  phonetic/phonemic  level 
knowledge  sources  to  discover  /-//s/  sequences  and  label  them  /t/.  Such  cases  will  lead 
to  a lower  overall  accuracy  score  for  O't  labeling  evaluation,  since  we  do  not  have  such 
knowledge  encoded  in  our  labeling  referent.  However,  the  confusion  matrix  entries  of  /t/ 
for  /s/  and  vice  versa  should  be  recognized  as  less  critical  by  anyone  investigating  the 
labeling  accuracy  of  a particular  parametric  representation. 

In  view  of  the  previous  discussions,  our  performance  measurements  would  appear  to 
be  the  lower  bounds  of  performance  to  be  expected  from  a particular  representation.  We 
feel  that  such  a lower  bound  is  as  valuable  a measure  as  more  optimistic  estimates  which 
integrate  the  results  of  some  higher  level  knowledge  sources.  Certainly,  If  the  total 
systems  are  to  be  modeled  in  terms  of  individual  processes,  such  a "separation  of  power" 
view  is  necessary. 

The  Idea  of  modeling  the  entire  system  is  particularly  attractive  but  difficult  to 
accomplish.  As  knowledge  source  Interaction  has  become  more  complex,  our  understanding 
of  the  implications  of  errors  at  various  levels  has  become  less  complete  — more  derived 
from  the  special  cases  actually  traced.  The  signal  detection  model  may  provide  a (very 
broad)  model  of  this  lowest  level  of  recognition  activity  — less  detailed  than  the  confusion 
matrix  model  of  errors,  but  easier  to  manipulate.  We  can  foresee  the  d*  detectability 
measure  In  use  to  parametrize  a zeroth  order  simulation  of  segmenting  and  labeling.  This 
model  might  be  improved  by  applying  the  conditional  probabilities  available  In  the 
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confusion  matrix.  As  similar  models  are  developed  for  other  levels,  overall  Knowledge 
interaction  schemes  can  be  simulated, 

9.6  Topics  for  Further  Research 

It  has  become  almost  obligatory  in  many  dissertations  such  as  this  one  to  include  a 
list  of  topics  for  further  research.  However,  we  would  like  to  include  such  a list  for  quite 
a different  reason  than  tradition.  In  the  course  of  these  investigations  into  parametric 
level  processing  of  speech  for  machine  recognition  and  understanding,  we  have  been  made 
aware  of  a number  of  interesting  possibilities  for  fextending  or  improving  techniques  for 
segmenting  and  labeling  as  well  as  some  interesting  approaches  to  evaluating  performance 
and  the  problems  of  training  the  parametric  routines.  We  believe  that  a great  deal  of 
progress  may  be  made  in  these  areas,  and  have  had  some  difficulty  in  keeping  to  a 
particular  path  of  research  with  all  the  tempting  problems  surrounding  this  level. 
Moreover,  we  have  spent  some  effort  in  presenting  a view  of  the  current  state  of  the  art, 
and  that  view  will  not  be  complete  without  pointers  to  the  aspects  most  likely  to  yield 
further  progress. 

9.6.1  New  Parametric  Repress ntationi 

The  search  for  new  and  better  parametric  representations  will,  of  course,  continue. 
Particular  models  of  speech  production  or  reception  in  humans,  such  as  the  all  pole  LPC 
model,  or  Baker’s  LIP  parameters^  [6akJM75],  will  continue  to  provide  new  insights  Into 
the  kind  of  information  and  encodings  found  in  human  speech.  Another  direction  yields 
parameters  and  decision  procedures  designed  to  detect  specific  phonetic  features.  [AtB75] 
It  Is  Important  to  consider  the  decision  procedure  to  be  used  with  a particular  parametric 
representation,  for  the  effective  shape  of  the  pattern  space  depends  upo'i  bothv 

Where  does  this  lead  for  future  parametric  representations?  Perhaps  a more 
integrated  approach  to  their  development  will  result  from  such  considerations  — one  In 


t based  upon  neuropsychological  evidence  of  zero-crossing  responsive  cells 
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which  the  needs  of  machine  recognition  of  speech,  and  machine  oriented  higher  level 
features,  play  as  strong  a role  as  aspects  of  human  perception  have  played  in  the  past. 
Certainly,  any  new  parametric  representation  must  be  extensively  tested  and  compared 
with  the  already  numerous  available  ones  if  we  are  to  make  real  progress.  In  this,  the 
work  reported  here  will  serve  as  a valuable  tool  for  guiding  research. 

9.6.2  Segmentation 

A number  of  different  segmenters  are  currentiy  being  deveioped  and  tested,  and  it 
is  quite  likely  that  features  of  many  will  prove  particularly  useful  to  others.  The  ideas 
expressed  in  our  segmenter  for  integrating  boundary  detections  from  a number  of 
functions  of  the  signal  may  prove  useful  tor  integrating  segment  evidence  from  a variety 
of  sources  of  such  knowledge.  However,  one  problem  which  seems  likely  to  yield  to 
immediate  beneficial  solution  is  that  of  adjusting  detection  thresholds  (or  whatever  tuning 
parameters  are  relevant  to  the  particular  segmenter  in  question)  to  the  non-stationary 
behavior  of  speech.  Boundaries  are  characterized  by  a variety  of  durations,  magnitudes, 
and  qualities  of  signal  change.  It  may  be  necessary  to  extend  the  period  of  time  over 
which  the  signal  is  viewed,  to  adjust  the  thresholds  to  reject  insignificant  changes,  or  to 
ignore  entire  regions  of  the  pattern  space,  if  they  are  independent  of  the  phonetic 
information  in  the  signal. 

A powerful  line  of  attack  is  suggested  by  recent  work  in  visual  segmentation. 
[0hl75]  In  this  approach,  histograms  of  the  parametric  measurements  are  analyzed  for 
each  scene,  in  order  to  determine  the  most  likely  parameters  for  segmentation  as  well  as 
the  best  thresholds  for  those  parameters  (for  that  scene). 

In  speech  research,  a very  low  level  segmenter  has  been  added  to  the  Dragon 
system  to  improve  speed  of  recognition  [Low76].  Good  success  has  resulted  from  a single 
detection  parameter  which  tracks  the  change  over  a varying  time  period,  looking  further 
during  slow  changes  for  evidence  of  acoustic  boundaries. 

The  message  here  is  that  a number  of  statistical  pattern  classification  techniques 
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seem  to  be  applicable  to  the  segmentation  problem.  Dynamically  adjusted,  self-training 
detection  routines  will  result  in  very  robust,  high  performance  segmentation. 

9.6.3  Recognition  Targets 

In  a very  similar  sense  as  segmentation,  better  training  of  recognition  templates  will, 
undoubtedly,  result  from  dynamic  tracking  of  input  data  in  the  manner  discussed  by  a 
number  of  pattern  classification  researchers,  We  wish  to  point  out  another  aspect  of  the 
target  problem  which  needs  attention  at  this  time  - the  integration  of  acoustic  and  higher 
level  knowledge  about  speech  in  the  recognition  target  set.  This  problem  has  been 
extensively  discussed  by  others  [Wei75]  and  efforts  have  been  made  to  construct,  a priori, 
sets  of  phonetic  labels  which  are  acoustically  distinct.  We  feel  that  such  sets  must  be 
discovered  in  much  the  same  way  as  other  aspects  of  the  pattern  space,  by  statistical 
analysis  of  bodies  of  data.  Clearly,  there  are  many  improvements  to  be  mad  » to  the  simple 
clustering  algorithm  of  Chapter  7.  We  look  forward  to  more  positive  results  from  date- 

derived  recognition  targets,  and,  in  addition,  from  data-derived  higher  level  rules  tSml75, 

Hay75]  I 

9.6.4  Evaluation  ■ 

With  regard  to  evaluation,  there  is  so  much  to  be  done  that  we  will  limit  our 
discussion  to  one  important  aspect  of  evaluating  accuracy  performance  at  the  parametric 
level.  That  Is  the  problem  of  acquiring  high-fidelity  referents  to  vhich  recognition  results  | 

may  be  compared.  Since  the  performance  of  knowledge  sources  at  one  level  of  speech 
may  not  be  expected  to  match  the  expectations  of  another  level,  performance  evaluations 
will  be  in  error  unless  the  level  of  description  (of  the  signal’s  contents)  of  the  referent  ' 

and  the  recognition  are  quite  similar.  \ 

An  obvious  procedure  is  to  take  great  care  in  hand  producing  the  referent 
transcriptions.  Not  only  is  this  extremely  laborious^  but  It  also  fails  because  the  human 
t which  leads  to  errors  and  limits  the  quantities  of  data  that  may  be  analyzed 
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transcriber  may  not  understand  just  what  the  expectations  and  implications  are  of  a 
knowledge  source  encoded  in  rules,  programs,  or  trained  statistics.  It  would  appear  that 
the  best  source  of  these  expectations  is  the  speech  understanding  system  itself.  The  time 
has  come  for  systems  to  be  designed  and  implemented  with  evaluation  of  various 
knowledge  sources  as  a basic  facility,  At  each  level  of  representation,  facilities  should  be 
available  to  derive,  from  the  knowledge  sources,  just  what  Inputs  from  other  sources 
would  result  in  the  correct  action,  decision,  etc.  As  an  example,  if  a higher  level  of  the 
system  can  recognize  the  transition  segment  /I/  in  lower  vowels  following  /g/  or  /k/  as  an 
indication  of  those  stops,  such  cases  in  the  referent  should  Include  /stop//I//vowel/  as  an 
alternative  to  /g,k//vowel/. 

It  may  well  be  time  to  depart  from  the  close  ties  to  human  perceptual  experience 
with  speech.  Some  of  the  most  successful  systems  to  date,  both  for  word  recognition  and 
connected  speech  understanding,  [Ita75,  BakJK75b,  Erm74b]  have  had  much  less  In  common 
with  what  we  know  about  humans  and  linguistics  than  with  what  we  know  about  computers 
and  artificial  intelligence  techniques.  It  is  less  important  to  model  human  processing  than 
to  match  human  competence;  especially  since  we  know  so  little  about  the  elements  of  the 
human  speech  perception  mechanism.  To  this  end,  the  best  use  must  be  made  of  quite 
different  information  processing  devices  than  humans  seem  to  have,  and  of  different  forms 
of  data  and  control. 

9.7  Envoy 

This  thesis  has  involved  investigations  of  a number  of  design  choices  for  the 
acoustic/parametric  level  of  computer  speech  recognition.  It  has  led  us  to  survey  a large 
range  of  techiques,  and  to  attempt  to  extract  aspects  of  Pattern  Classification,  Acoustic 
Analysis,  and  Performance  Evaluation  most  relevent  to  the  stated  goal  — a comparison  of 
segmentation  and  labelinr^  performance.  The  current  efforts  to  develoo  speech 
Understanding  systems  a"e  producing  in  their  wakes  a number  of  theories  about  speech. 
Although  overall  performance  of  a total  system  is  an  important  measure  of  the  validity  of 
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Its  •ssumptions,  the  difficulty  of  studying  eech  system’s  components,  in  vitro,  hes  been  a 
handicap  to  the  entension  of  our  understanding  of  the  entire  problem.  This  work  is  an 
attempt  to  extract  one  basic  component  and  evaluate  it  In  a manner  which  will  both  aid 
designers  and  increase  understanding  of  Its  role  In  the  total  speech  recognition  problem. 
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SI:  Segmentation  — Some  Cases 

The  following  are  some  cases  where  the  hand  and  machine  segmentations  disagree. 
They  are  classified  according  to  type  of  error  {(M)issing  or  e(X)tra)  und  degree  (0-machine 
correct,  I -not  critical,  2-critical  error).  We  introduce  these  cases  to  illustrate  the  various 
phenomena  which  are  involved  in  segmentation,  and  which  must  be  considered  in 
evaluating  segmentation.  Two  displays  are  given  for  each  case;  a plot  of  the  digitized 
waveform  --  lOKHz.,  9 bits  — ari  a plot  of  the  SPG  parameters  (which  serves  well  as  a 
digital  spectrogram). 

M2  — Cases  where  a critical  segment  boundary  is  not  detected: 

/EH//L/,  slow  change  In  sonorants  not  detected 


Frequency  In  Hz 
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/■//G/t  voice  bar  not  detected 
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MO  — Cases  where  hand  segmentation  is  not  correct 


/EL/,  no  separate  vowel  segment 


/AO/,  utterance-final,  no  separate  /L/ 
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/NG/,  nasal  to  voice  bar,  /G/  probably  deleted 
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XI  — Cases  where  machine  boundary  Is  incorrectly  included: 


/B/,  voice  bar  lost  (SPG  amplitude  parameter  insensitive) 
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XO  — Cases  where  hand  boundary  should  be  indicated: 

/K//K/,  burst  and  aspiration  separated 


SI:  Segmentation 


/Z//SA  voiced  transition  segment  detected 
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IA\NIIA\N/,  vowel  onsev  trarsfition  segment 
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S2:  Segmentation  --  Hand  Corrected  Machine  Segmentation 

In  the  fo'lowing  waveform  plots,  the  results  of  running  the  segmenter  with  SPG 
input  parameters  are  shown.  Ratings  of  all  the  points  of  disagreement  with  the  referent 
segmentation  are  given.  In  this  plot,  the  referent  segmentation  is  given  below  the 
* waveform  in  two  lines,  a phonemic  and  a sub-phonemic  transcription.  The  ratings  are  the 
same  types  as  indicated  in  Appendix  SI. 
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LI:  Labeling  Evaluations 

The  following  are  evaluations  of  labeling  for  a number  of  the  parametric 
representations,  distance  metrics,  and  labeling  target  sets  investigated.  The  entire  design 
space  for  labeling  could  not  bo  covered  in  this  appendix.  The  first  table  In  each  case 
contains  two  different  acceptance  criteria  for  templates:  position  (POS)  and  relative 
distance  (RELDST).  This  latter  is  the  difference  between  the  score  of  the  best  template 
and  that  of  the  template  in  question.  For  either  criterion,  the  target  class  accuracy  (CL- 
SCOR)  and  the  target  class  branching  factor  (CL-POS  and  BRNCH)  are  given  for  a range  of 
acceptance  values.  The  second  display  is  the  confusion  matrix  for  the  evaluation  run. 
Entries  are  conditional  probabilities  « 100. 
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9 . 

H 

fi 

19  . 

. . E 

. . 

C . 

13 

. 13 

3 

. 

3 3 

3 IE  3 

3 . 

13  3 3S 

3 E 

e 

. 

. 

E . 

. . 

3E  . 

34 

, , 

S 

S S 

S . 

, 

IP 

10  ■ 

IP  IP 

S . 

3S  . 

3 

7 

IP 

3 . 

33  3 

. IE 

39 

. 7 

3 

3 3 

. 3 3 

. . 43 

13  IS 

IE 

33  . 

E7 

3 

3 . 

. 3 . 

3 . 

41 

3 33 

3 

SO  . 

3S 

17  . 

■ 17 

10 

IP  . 

10  . 

. 40 

7 

7 . 

. 

1 

2 

1 

. 4 

s . 

3 . 

, 7 

3 

3 . 

C . . 

c 

C . 

. 3 

3 

. 3 

4 

E . 

. 4 

3 

3 

1 . 

I I . 

3 . 

. 1 

4 . 

. s 

1 1 

3 . 

3 . 3 

. II 

1 

1 . 

3 . . 

3 . 

1 

. 1 

N NY  CM  U P I.  r EL  UU  IW  CM  AO  Mt  PX  CP  EH  EY  PC  U IH  IT 


0 s 

. 3 . 

3 II 

M 

. ? 3 

. 14 

3 

S 

3 . 3 

■ 4 

, 

. 13  6 

4 

3 

3 3 . 

a . 

3 

9 

3 

3 

3 

3 . . 

1 S 

E 

a a a 

3 3 

33 

3 

. 3 3 

7 39 

7 

. 3 8 

. 3S  . 

33 

. 33  . 

IP 

. 30  . 

9 39 

7 

4 

4 

, 

14 

. 

7 El 

IE 

3 

3 

1 

1 

. 

. 43  31 

31 

S 

S 

30 

Cl) 

7 

13 

3 

4 

39 

II 

3 

IE 

9 

7 

3 

. 3 7 

. 

3 

3 

c 

3E 

IS 

17 

9 

9 

. 3 . 

a 

31 

. 35  35 

.100 

a . 

E 

a 

3 

IE 

3 

s 

. 31  41 

a 

31 

B 

33 

a 

IS 

. 33  . 

IS 

E 

E 

31 

a 

a 

a 

. IS  . 

7 

13  17 

IB  17 

30 

10 

a a 

IS 

13 

a 

4 

4 

13 

4 

a 19  . 

3 

3 

1 

1 

1 

1 

4 

a 

1 

4 

3 

37 

1 

3 

1 

S 

3 16  5 

a 

17 

4 

a 

4 

39 

4 

. . 13 

. 

c 

1 

s 

9 

1 

1 

13 

3 

3 97  . 

• 

EP 

. 30  . 

. 

3 

3 

c 

3 

1 

3 

3 

II 

. 39  a 

13 

C 

C 

C 

C 

13 

6 

. 31  13 

1 

. 

30 

3 

IS 

4 37  9 

• 

* 

• 

1 

41 

. 

1 

1 

S 

1 

1 
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Br  V*.  ftrcur*:,  Cnfut,,* 

e»f  clwtltrti  76 
Oii»»"ct  Nitrici  tUC 
P*rwi*(r.c  Ptvrtt*ntii(i(.nt  A$A 

- «•  - Nt«.) 


Td<«| 


Coufti 

POS 


I1(C 


CL-POS  CL-KW  PtLOSr  a-«(T  SPWH 


1 

2 
3 
1 
s 
s 

7 

S 

9 

10 

11 

12 
13 
M 
IS 
IS 
17 

le 

19 

70 

21 


1.00 
1.94 
2.B7 
3.73 
4.SS 
S 39 
S 17 
S.83 
7.49 
1.17 
• M 
9 S9 

10.27 

lo.n 

11.54 

12. IS 

12. 7S 

13  42 

14  07 

14. 54 

IS. 23 


27.10 
39.12 
50.43 
SO. IS 
S3. 09 
68.71 
71.83 
74.18 

75.81 

77.81 
79.30 

81.81 
82. IS 
82.79 
84.87 
84.64 
es.42 

86.27 

87.27 
88.41 
88.98 


8 

8 

T 

0 

K 

C 

r 

V 

TH 

m 

s 

2 


77 

2S 

12 

12 

18 

3 

S 

S3 

3 

W 

2 

8 


10 
7 21 
■ 2 


8 r 0 

3 10  2 

■ 19  . 

» 24 

2 3!  4 

S 9 S 

3 M S 

3 3 9 

■ ■ II 

■ . 3 

S . 

■ 14 

S . . 

• 3 . 


r V 

■ 6 
. 6 
2 . 

■ S 
• S 
. 2 


.00 

1.00 

2.00 

3.00 

4.00 

5.00 

s.oo 

7.00 
8.08 
9.08 

IO.r« 
11.00 
12  (« 
13. (« 

14.00 

15. 00 
16.08 

17.00 
I8.W 

19.08 

20.08 

TH  OH  S 

. 4 

. 4S  . 
4 . . 
2 21  . 
3 . I 
■ 19  . 

. S . 

. 17  . 

• 18  . 

. 43  . 

2 . SS 
• 3 82 


27.60 
. 32.36 
3S.2I 
39.0S 
42.82 
46.87 
SO.  71 

ss.os 

69.17 

61.74 

64.68 

67.43 

69.42 

71.78 

73.76 

74.89 

77.03 

78. 45 

79.45 
80.37 
•1.29 


'.34 

1 6S 
1.95 
2.33 

2 69 
3.10 
3.S3 
4.00 
4.47 
4.94 
S.49 
6.O.- 
6.S9 
7.12 
7.70 
8.27 
8.81 

9 36 
9.93 
10.46 
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2 SM2HI84  n Hm 
•••■II. 


2 18  TO 
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• 6 24 

■ S2I 

■ S . 


• • • IT  • . . . 
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* • T I . . I .'  i 


• • • ■ 2 . . 2 

3 . . . . 


£8 

. 8 

• 4 

«C 

. 3 

3 . 

[N 

. 1 

IH 

. 1 

IT 

1 1 

1 . 

•1 
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to 

tl 
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ai 

18 
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18  .' 

14 

7 

II 

2 

2 

• 
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2 S 
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.100 

S8  S8 

. SO 


. 6 S 
• 2 

• . 16 

7 9 

S . 
27  27 


13 


» L t 114  134  01  80  00  £8  8t  tH  !!•  ir  OX  ET  Cl 

. I 

6 

• • • . 2 


3 3 


3 7 
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23  57 
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33  7 
4 4 

13  . 

6 . 


7 33  . 

■ il  18 
. II  . 


26  2 


S 3 


8 

23 

7 

12 


9 

23 

48 


17 


7 

.'1 

2 

13 
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8 4 


. I 
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4 28 
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17 


4 

4 

II 

7 


6 

16 


10 


13 
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2 I 
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• 4 23  19 
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I I 
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2.22 

8 SO  . 6 

■ 28  6 73 

■ . 31  . 


2 I 


■ S 
. 16 
8 23 

• . . . 

3 3.. 
4.84 
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4 
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Of  vt . f«rcwf»cr  i^d  Confuiu-n 
of  clur.Urf.i  rC 
Diftt*nc<  fW  tr  »r  I COP 
Pir««4irir  Pipf  ricntut  »c*n?  ftSrt 

Dfttft  filfi  y<P  • CC<  iMi-  ' 


C»ur>l  I 

P05  Cl  -POS  CL-trOP  PCLOST  CL-UOP  fiPNCH 


1 

1 no 

24. 9G 
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1.45 

2 

1 92 

3G  99 

I.OP 

34.50 

2,06 

3 

2 90 

49.29 
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42.46 

2 75 

4 

3E4 

52<  25 

3.f*0 

49.79 

3 49 

s 

4.39 
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54.77 

4.19 

E 

S.ll 

56. 93 
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5.00 

7 

5.93 
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62.52 

5.70 

e 

9 43 

72.49 

7 00 

65.59 

5.36 

9 

7 p; 
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67.50 

99 

III 

7 59 

75. 03 

9.r«3 

69  Ei 
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II 

9 21 

77  95 
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70.20 

«.IE 

12 

9 04 

79.97 

11.00 

71.91 

6.76 

13 

9.44 

91.59 
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73.5. 

9.32 

14 

9.99 
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13  IV 

74  92 

9.63 

IS 

IP  55 

94  50 

14.00 

75.99 

10.41 

1C 

III? 

96  20 

15.00 

77.39 

10.99 

17 

II  54 

96.91 

16.00 

79.39 
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19 

12  23 

67  62 

i;*.oo 

79  96 

12.03 

19 

12  91 
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16.00 

»3  16 

12  55 

20 

13  33 
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61.29 
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21 

13  99 

93.69 
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6 
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c 
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IK 

- 

^3 

1 

6 

M 
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3 
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1 

1 

P 
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19 

13 

6 

13  6 

. 6 

6 

6 

13 

9 

2 

2 

26 

5 

52 

. 5 

2 

2 

2 

3 

6 

2 

26 

4 

2 

4 

2 

4 

26  16 

. 2 

4 

0 

5 

2 

4 

II 

II 

2 

4 

2 

2 

2 

16 

5 5 

2 19 

2 

2 

4 • 

* 

5 

3 

6 

Jf 

3 

9 

3 

3 12  95 

. . 

6 

3 

t • 

3 

C 

2 

2 

2 

9 

9 

2 

21 

29 

. 12 

7 

• 

f 

37 

II 

5 

1. 

II 

5 

II  . 

5 

5 . 

V 

3 

3 

3 

23 

20  27 

IP 

3 

3 

3 . 

. 

TH 

10 

IS 

5 

5 

5 

5 

IS 

30  5 

5 

OH 

7 

7 

SO 

14 

14 

7 

5 

5 

2 

2 

66 

9 

6 6 

. 2 

2 

3 

2 

3 

3 

69 

15 

. 6 

. 6 

SM 

100 

25  75 

67 

17 

. 17 

11 

14 

14  54 

4 

4 

N 

1 

1 

13 

2 

13 

1 

6 55 

1 

6 

unt 

11 

16 

.. 

16  47 

II 

u 

13 

7 

13  13 

33 

13 

7 

9 

4 

2 

2 . 

. 

4 2 

7 

19 

2 

13 

4 

2 

2 22 

2 

2 

9 

1 

4 

4 

2 

2 

2 

, 

2 . 

4 

37 

9 

2 

13 

6 

. 

. 

. 2 

2 

2 

4 

• 

Y 

6 

13 

6 6 

. 13 

56 

uu 

3 

. , 

. 9 

3 

21 

3 

. 13 

31 

6 

3 

5 

6 

e 

9 

9 

6 

. 

15  IS 

39 

04 

15 

33 

6 

, 

9 . 

9 

23 

fC 

3 

3 

3 

27 

3 

7 

13 

. 

3 

■ 13 

4 

23 

OA 

4 

e 

12 

9 

12 

15 

IS 

4 

4 4 

0 

9 

to 

4 

4 

4 

. 

4 

4 6 

4 

13 

4 

33 

. 6 

4 

4 

4 

Of 

3 

3 

3 

3 

6 

. 

6 

3 

9 17  31 

6 

II 

3 

CM 

■ 1 

4 

1 

1 

1 

7 

. 

27  33 

11 

9 

4 

IH 
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1 1 

2 . 

1 

4 

1 

1 

. 

1 

7 38 

5 26 

1 

6 

IT 

1 

1 

5 

1 5 

2 

6 

6 

. 

2 10 

6 

49 

6 

4 

OX 

1 

. 

1 

3 

I . 

. 1 

1 

. 

1 

6 

5 

7 

1 

2 

5 

7 

9 13 

104  5 

16 

2 

1 

4 

tr 

. 

. 

. 

. 

. 

. 

. 40 

4 

40 

29 

CL 

.100 

. 

. 

. 

4 4 

. 

4 

CN 

. 

. 

. 

. 

. 

. 

4 4 

. 

4 

4 

IK 

. 

13 
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. 

6 

6 

. 

6 

. 
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4 

4 

26 
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BT  V*.  Accurac)>-  and  Canfuttm  flatriH 
ef  clutitrti  76 
0(t(anc«  fWtrrct  Lit. 

Par»Mtrlc  Pf1>r>t«n«a(  IBTII  KA 

Data  filai  T6P  (artilar  - CC.  latk  - Ntxtl 


Total 


Court  1 1 
POS 


Mrs 

CL-POS  Ct-trOR  «L0ST 


ei-LC»  BMC'I 


K 

31 

M 

M 

31 

t 


1 

1 PP 

3S.II 

.00 

36.17 

1.16 

3 

1.63 

3S.49 

1.00 

44. 4S 

3.11 

3 

3. 38 

44.10 

3.00 

61.10 

S.43 

4 

3.94 

SO. 36 

300 

70.77 

7.S7 

s 

3. S3 

SS.33 

4.00 

77.67 

IP  36 

6 

4.13 

S9.I7 

s.oo 

83.07 

13.30 

7 

4.69 

61.74 

6.00 

86.49 

14.67 

8 

S.34 

64.44 

7.08 

89.69 

16. 37 

9 

5.7S 

66.71 

a 00 

91. 60 

17.77 

18 

6.3S 

69.49 

9.00 

93.67 

18.89 

II 

6.74 

73.33 

10.00 

93.46 

38.05 

13 

7. IS 

74. 3S 

11.00 

94.31 

38  95 

13 

7.67 

76. 34 

13.80 

94.  S9 

31.73 

14 

811 

79.03 

1300 

95.83 

33  63 

IS 

8. 64 

79.89 

14.00 

9S.S3 

33S0 

16 

9. 14 

80. SI 

IS. 80 

tS.BO 

34.13 

17 

9. S3 

81. IS 

16.00 

9S.87 

34  78 

19 

9. 93 

83.86 

17.00 

96.16 

35.43 

19 

IP  33 

83.64 

18.00 

96.44 

35  83 

30 

10.78 

84. 3S 

19.00 

96.t>9 

36.36 

31 

II. 07 

K.3I 

30. Of 

16.94 

36.66 

P 8 

t 0 X 

c r 

7 TM  DM 

S 3 SH  3N  M4 

I ■ . . 

■ S 6 . 

17  . . . 

3 t 16  • 
9 3 7 . 
9 31  3 


. 17  . 
3 . 13 
. 14  . 
31 


I I 

6 . 


6 I 

6 . 


C 

• 

, . 

9 

9 6 16  3 

3 . 13  S 3 

r 

47 

■ . 

II 

■ ■ ■ II 

■ . II  S . 

V 

7 

1 3 

. 

1 . . . 

1 . 30  . . 

TH 

IS 

. S 

16 

■ ■ 18  38 

. 30  6 10  . 

CM 

7 

7 . 

7 . . . 

. 7 14  . . 

6 

8 

..63 

. . . S3  3 

14  3 3 
. I . 
9 . 3 


3 

SH 

3H 

»M 

R 

N 

•n 

H 

P 

I 

T 

UU 

1*4 

OM 

*0 

m 

c» 

w 

CM 

IM 

IT 

M 

tr 

CL 

CN 

IX 


• II  3 l«  II  3 
. . 14  . . . 


6 

7 

13 

9 

16 

7 


■ 67 

■ l«9 


41  13 I 

33 


3 3 


. 43 
. 14 
3 39 


•3 


II  . 
IP  . 
IS  II 
7 . 


. 14 
• II 
. S 


• • 

I . 


4 

4 

kl 


14  36  II  4 . . 

I $3  13  I . . 

S 47  16  . . . 

7 37  IS  4«  . 7 

• 3 3 II  II  . 

3 3 I 6 3 33 


17 

4 

3 


4 

II 


.11. 

■ ■ I . 

■ 14. 

I ■ I . 

3 3.3 


17 

I 


I 

• 

31 

37 

13 


. I 

. a 

k 16 

I 17 
■ 13 


6 6 


IS 


I 3 3 


3 6 


I I I 


. 33 
. IS  . . 
I I 7 IB 

■ II  4 13 

. . 3S  4 

...  43 

■ I I 41 
. . . B 

■ 1.3 

■ 7 . 18 


' II  . 

4 . 

U • 

S3  6 

II 


II 


• . 

7 . 

■ 19 
4 3S 
• 33  I 

3 3S  I 

4 41  18 
I H 18  1 

5 3F  3 4 
38  38  38  48  . 


• I 


S 6 


II 


• 19 
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0^  v»  ^,cgr#ry  Confu&iO'' 

N*/»k<r  o<  rluktrrft!  76 
tti’r it:  SIC 

Pi»r«M»«tric  P«pf  ifm  flS*^ 

D*\ • FtJrj  Top  lurvirf-ir  • CC . ln»i  - N«u») 


Total  Count  lt(C 

ros  a-fw  CL  \cop  PfLOSi  ci*vcop  bpnch 


1 

1 

00 

78  GS 

(«fl 

79.38 

1 33 

1 

1 , 

DO 

41.7S 

1 (K» 

3S 

1 79 

3 

2 

■ft*:' 

SO  S7 

r tv 

41.68 

7 78 

3 

h* 

SG  S4 

3.W 

47. SI 
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S 

47 
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E 

S 

:o 
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3.98 

7 

6. 

06 
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6.00 
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fl 

6. 

67 
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7.00 

61. S7 
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9 

7, 
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S 69 

10 

?, 

83 
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11 

e 

SI 
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i; 

g 

IS 
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7.S8 

13 

g 

r* 
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8.89 

M 
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30 
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IS 

10 

8S 
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77.4S 
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IG 

11 

37 
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17 

11 

87 
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W.89 
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18 

u 

<3 
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88  SI 
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19 

K 

99 
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81.  SI 
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78 

13 

SO 
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ig.oo 
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71 

M 

84 
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• 

P 

8 

T 

0 

C 

f 

V 
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s 

7 
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R 
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U 

P 
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Y 
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1 
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P 
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19 

19 

6 

£ 

6 
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8 

17 

10 

7 

36 

33 
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, 

7 

7 
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I 

18 

7 

7 

TO 

e 

7 

7 

4 

4 

7 

78 

4 

7 

7 
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0 

18 

4 

4 

S 

11 

9 

17 

7 

7 

4 

IE 

7 

7 

4 
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K 

3 

9 

71 

3 

3 
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3 

6 

3 
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c 

S 

7 

17 

17 

4 

16 

7 

14 

7 

14 

7 

. . S 

r 

S3 

S 

S 

S 

16 

5 

S 

E 

. 

V 

3 

3 

7 

70 

, 

, 

3 

7 

7 

3 

17 

13 

13 
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IH 

10 

S 

10 
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10 

S 

15 

S 

OH 

14 

M 

7 

7 

36 

7 

7 

, 

7 
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S 

7 

S 

3 

3 

i 

n 

t 

E3 

9 

8 

, 

, 

, 

7 
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7 

6 

3 
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L2:  A Machine  Transcription 

The  following  is  a transcription  of  some  of  the  sentences  in  the  TAP  date  set 
(speaker  CC,  task  News  Retrieval).  The  transcription  is  the  result  of  segmenting  and  then 
labeling  with  the  routines  described  in  this  dissertation.  The  input  parameters  were  the 
SPG  spectrograms,  and  the  labeling  was  done  with  the  SIG  metric  and  a set  of  templates 
very  similar  to  the  ones  used  in  the  labeling  evaluations.  (Some  hand  correction  of  the 
template  names  was  done  by  adding  phonetic  modifiers  to  the  template  labels.)  The  first 
column  indicates  the  time  in  centi-seconds  at  which  either  the  hand  or  machine 
transcription  changes.  The  second  column  is  the  hand  transcription.  The  remaining 
columns  indicate,  in  order  of  increasing  distcnce  (decreasing  rating  score),  the  recognized 
templates.  The  best  score  is  50,  this  is  arbitrarily  assigned  to  silences  and  flaps,  which 
are  detected  by  the  segmenter.  The  ARPABET  uppercase  phonetic  symbols  have  been 
used  throughout  this  work. 
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